Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ARAUS: A Large-Scale Dataset and
Baseline Models of Affective Responses to
Augmented Urban Soundscapes

Kenneth Ooi,  Zhen-Ting Ong, Karn N. Watcharasupat,  Bhan Lam, 
Joo Young Hong, and Woon-Seng Gan
K. Ooi, Z.-T. Ong, B. Lam, and W.-S. Gan are with the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. Emails: {wooi002, ztong, bhanlam, ewsgan}@ntu.edu.sg. K. N. Watcharasupat was with the School of Electrical and Electronic Engineering, NTU, Singapore. She is currently with the Center for Music Technology, Georgia Institute of Technology, Atlanta, GA, USA. Email: kwatcharasupat@gatech.edu. J. Y. Hong is with the Department of Architectural Engineering, Chungnam National University, Daejeon, Republic of Korea. Email: jyhong@cnu.ac.kr. The research protocols used in this research were approved by the NTU Institutional Review Board (Ref. IRB-2020-08-035). The ARAUS dataset (including rejected data) is publicly available in the NTU research data repository DR-NTU (Data) at https://doi.org/10.21979/N9/9OTEVX. Replication code for analysis and baseline models in this paper is available at https://github.com/ntudsp/araus-dataset-baseline-models. ©2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Manuscript received Jul 3, 2022; revised Dec 22, 2022; accepted Feb 9, 2023.
Abstract

Choosing optimal maskers for existing soundscapes to effect a desired perceptual change via soundscape augmentation is non-trivial due to extensive varieties of maskers and a dearth of benchmark datasets with which to compare and develop soundscape augmentation models. To address this problem, we make publicly available the ARAUS (Affective Responses to Augmented Urban Soundscapes) dataset, which comprises a five-fold cross-validation set and independent test set totaling 25,440 unique subjective perceptual responses to augmented soundscapes presented as audio-visual stimuli. Each augmented soundscape is made by digitally adding “maskers” (bird, water, wind, traffic, construction, or silence) to urban soundscape recordings at fixed soundscape-to-masker ratios. Responses were then collected by asking participants to rate how pleasant, annoying, eventful, uneventful, vibrant, monotonous, chaotic, calm, and appropriate each augmented soundscape was, in accordance with ISO/TS 12913-2:2018. Participants also provided relevant demographic information and completed standard psychological questionnaires. We perform exploratory and statistical analysis of the responses obtained to verify internal consistency and agreement with known results in the literature. Finally, we demonstrate the benchmarking capability of the dataset by training and comparing four baseline models for urban soundscape pleasantness: a low-parameter regression model, a high-parameter convolutional neural network, and two attention-based networks in the literature.

Index Terms:
Soundscape, dataset, regression, deep neural network, soundscape augmentation, auditory masking

1 Introduction

Soundscape augmentation involves the addition of sounds, typically referred to as “maskers”, to existing acoustic environments in an effort to change the perception that real or hypothetical listeners may have towards them. This perception-based approach to noise mitigation was developed upon findings that indicators based purely on the sound pressure level are inefficient or ineffective at influencing the perception of soundscapes [1, 2, 3, 4]. For instance, based on a synthesis of studies in the literature, [5] found that at the same day-night level above 52 dBtimes52decibel52\text{\,}\mathrm{dB}start_ARG 52 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, the percentage of people rating aircraft noise as “highly annoying” significantly exceeded that for road traffic noise, and the percentage for road traffic noise in turn significantly exceeded that for rail noise. This is further backed by a numerical study conducted in [6].

Changes in perception have ordinarily been measured using subjective ratings of descriptors such as “soundscape quality” [7], “preference” [8], “perceived loudness” [9], “vibrancy” [10], and other adjectives described in the ISO/TS 12913-3:2019 circumplex model of soundscape perception [11]. These ratings, when aggregated over multiple human participants or soundscapes, typically form a set of indicators that soundscape practitioners can use to guide interventions or analyses of a given soundscape.

However, a perennial concern lies in choosing “optimal” or “appropriate” maskers to optimize the value of a given perceptual indicator, given the large varieties of possible maskers. This is important in real-life applications of soundscape augmentation systems, such as an automatic masker system that plays back maskers in real time in time-varying urban acoustic environments [12, 13].

Most prior studies have primarily addressed this concern by expert-guided or post hoc analysis [14, 15], but the number of maskers and masker types under consideration at a time tends to be relatively small, on the order of tens at a time. This thus limits the choice of possible maskers for different scenarios and the generalizability of conclusions of soundscape augmentation studies to more diverse real-world scenarios.

Hence, a dataset of subjective perceptual responses to a broad variety of augmented urban soundscapes, comprising a large set of urban soundscapes and a correspondingly diverse set of maskers, would pave the way to systematic investigations of suitable maskers and masker types for urban soundscapes in general, as well as the development of better models for soundscape augmentation. Such a dataset could also alternatively be used for soundscape studies desiring a representative sample of soundscapes from a perceptual space or with given characteristics, such as those performed by [16] for pleasantness, [17] for soundscape quality, and [18] for validating translations of perceptual attributes.

However, to the best of our knowledge, there is currently no common, consolidated benchmark dataset for fair cross-comparison of prediction models in the soundscape literature. According to a systematic review of prediction models by [19], the largest datasets used by prediction models for soundscape perception in the literature also appear to individually number from the thousands to about ten thousand samples. As such, this work aims to craft a large-scale, public dataset of affective responses to soundscapes which can serve as a benchmark for designing models to guide masker choice in soundscape augmentation.

Our proposed Affective Responses to Augmented Urban Soundscapes (ARAUS) strives to achieve this by using urban soundscapes covering a wide range of acoustic environments and maskers covering several sound types previously investigated in the soundscape literature. Additionally, the data collection protocol is compliant with the ISO/TS 12913-2:2018 [20] standard and is designed to be replicable with minimal specialized equipment, ensuring that this dataset can be extended by as many research groups as possible.

Due to the scope and nature of the proposed dataset, the collected data can also secondarily serve to empirically verify findings from existing soundscape literature. Additionally, the size of the dataset allows for relatively larger machine-learning models to be investigated along with traditional low-parameter models.

This article is organized as follows. Section 2 gives an overview of related soundscape datasets in the literature. Section 3 describes the stimuli generation and data collection method used. Section 4 describes exploratory and confirmatory data analyses of the collected responses. Section 5 presents benchmark models for prediction of perceptual responses to augmented soundscapes. Section 6 discusses the limitations and future directions for this work. Finally, Section 7 summarizes the findings of this work and makes some concluding remarks.

2 Related Datasets

Publicly available, large-scale audio datasets have been made available as unlabeled data and/or labeled data for several “objective” acoustic tasks such as acoustic scene classification [21, 22], sound event localization and detection [23], and anomaly detection [24]. However, affective audio datasets tend to be smaller and rarely reach the scale of those developed for objective tasks. In particular, to the best of our knowledge, there is no publicly-available affective soundscape dataset at the scale of the proposed ARAUS dataset.

In this section, a selection of large-scale and affective audio datasets is reviewed. A brief overview of key details for the datasets discussed is shown in Table I.

2.1 Large-scale Audio Datasets

The largest curated audio dataset is arguably AudioSet, which at the time of its initial release contained 1.7M 10-second segments of audio from YouTube videos organized in a systematic ontology for audio tagging [25]. Strong labels have subsequently been provided for a subset of audio [26], and at the time of writing, AudioSet has grown to about 2.1M segments, with 120K being strongly labeled [27]. However, the publicly available version of the data constituting AudioSet is in the form of 128128128128-dimensional VGGish features [28] due to potential copyright issues related to the use of raw audio from YouTube videos, which limits its use as a public dataset.

Alternatives to AudioSet include UrbanSound8K [29], ESC-50 [30], and FSD50K [31], which contain labelled Creative Commons-licensed tracks from Freesound [32] and serve as publicly-available benchmark datasets for weak audio classification. However, the nature of the stimuli in these datasets tends to be monophonic or same-class polyphonic. While this is useful in reducing the complexity and noise in inputs to train robust sound event classification models, the full complexities of real-life acoustic environments necessary for soundscape research are rarely represented in these datasets. Individual monophonic stimuli may be used to compose synthetic soundscapes like in the URBAN-SED dataset [33], which used the tracks in UrbanSound8K, but the focus of UrbanSound8K on just 10 possible event classes may be insufficient to emulate the variety of sound sources possible in real-life urban environments.

Therefore, multiple efforts have been made to record real-life acoustic environments, which are generally polyphonic in nature, and provide them as publicly available datasets for use in sound and soundscape research. These include Urban Soundscapes of the World (USotW) [34]; EigenScape [35]; SONYC Urban Sound Tagging (SONYC-UST) [36] and SONYC-UST-V2 [37]; TUT Acoustic Scenes [38, 39] and TAU Urban Acoustic Scenes (TAU-UAS) [40, 41, 42]; Singapore Polyphonic Urban Audio (SINGA:PURA) [43]; Ambisonics Recordings of Typical Environments (ARTE) [44]; and Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) [45].

However, the aforementioned datasets contain no corresponding “subjective” labels concerning the affective perception of the recorded environments. This limits their use as datasets for soundscape augmentation, because knowing how individual soundscapes are perceived by humans is crucial in analyzing and modeling their perception. In addition, with the exception of USotW, the recordings were not compliant with ISO/TS 12913-2:2018 [20] due to them preceding the publication of the standard, or being made for a different purpose. Hence, they may not be immediately suitable for use under the ISO 12913 paradigm, which the ARAUS dataset was designed under.

2.2 Affective Sound Datasets

Nonetheless, audio datasets with labels specific to perceptual indicators of the stimuli also exist. For example, the International Affective Digitized Sounds (IADS) dataset [46] has had labels for discrete emotional categories elicited by the 167167167167 individual stimuli provided by [47], with an expanded version (IADS-E) provided by [48]. The largely monophonic stimuli in IADS, however, suffer from the same drawbacks as datasets based on Freesound. As a workaround, the Emo-Soundscapes dataset [49] used individual clips from Freesound to synthetically generate 1213121312131213 soundscapes, each 6 stimes6second6\text{\,}\mathrm{s}start_ARG 6 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG long, and obtained corresponding valence-arousal labels based on the Self-Assessment Manikin (SAM) [50] from participants on the CrowdFlower platform.

TABLE I: Selection of datasets related to the ARAUS dataset. Datasets with multiple versions only have their latest or most complete version listed.
Dataset Year Samples Length Locations/Sources Labels
AudioSet [25] 2017  2.1M 10 stimes10second10\text{\,}\mathrm{s}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From YouTube Weak sound events
AudioSet Strong [27] 2021  120K 10 stimes10second10\text{\,}\mathrm{s}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From AudioSet Strong sound events
UrbanSound8K [29] 2014 8.7K \leq4 stimes4second4\text{\,}\mathrm{s}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From Freesound Weak sound events
ESC-50 [30] 2015 2.0K 5 stimes5second5\text{\,}\mathrm{s}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From Freesound Weak sound events
URBAN-SED [33] 2017 10.0K 10 stimes10second10\text{\,}\mathrm{s}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From UrbanSound8K Strong sound events
FSD50K [31] 2020 51.2K \leq30 stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From Freesound Weak sound events
SONYC-UST-V2 [37] 2020 18.5K 10 stimes10second10\text{\,}\mathrm{s}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Public locations in New York City Weak sound events
SINGA:PURA [43] 2021 6.5K 10 stimes10second10\text{\,}\mathrm{s}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Public locations in Singapore Strong sound events
STARSS22 [45] 2022 173 \leq300 stimes300second300\text{\,}\mathrm{s}start_ARG 300 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Indoor environments in Tampere & Tokyo Strong sound events + direction of arrival
USotW [34] 2017 127 60 stimes60second60\text{\,}\mathrm{s}start_ARG 60 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Urban public spaces in 9 cities worldwide Acoustic scenes
EigenScape [35] 2017 64 600 stimes600second600\text{\,}\mathrm{s}start_ARG 600 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Public locations in North England Acoustic scenes
ARTE [44] 2019 13 120 stimes120second120\text{\,}\mathrm{s}start_ARG 120 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Indoor environments in Sydney Acoustic scenes
TAU-UAS [41] 2019 23.0K 10 stimes10second10\text{\,}\mathrm{s}start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Urban public spaces in 12 European cities Acoustic scenes
IADS-2 [46] 2007 167 6 stimes6second6\text{\,}\mathrm{s}start_ARG 6 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From digital recordings SAM
IADS-E [48] 2018 935 6 stimes6second6\text{\,}\mathrm{s}start_ARG 6 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From IADS-2, Internet, or composer SAM + basic emotion ratings
Emo-Soundscapes [49] 2017 1.2K 6 stimes6second6\text{\,}\mathrm{s}start_ARG 6 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG From Freesound SAM
ATHUS [51] 2019 978 \leq79 stimes79second79\text{\,}\mathrm{s}start_ARG 79 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Various locations in Athens Subjective soundscape quality
ISD [52] 2021 1.2K 30 stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG Urban public spaces in 13 European cities Affective + other labels per SSID Protocol [53]
ARAUS [Proposed] 2022 25.4K 30 stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG USotW, public locations in Singapore (base) Affective responses
Freesound, Xeno-canto (maskers) (+ contextual information, see Section 3)

In contrast, datasets with perceptual labels for real-life audio recordings include the Athens Urban Soundscape (ATHUS) dataset [51], which contains 978978978978 crowd-sourced recordings with corresponding labels for subjectively-rated soundscape quality on a five-point Likert scale, as well as the International Soundscape Database (ISD) [52], which as of v0.2.1, consists of 1258125812581258 30303030-second long recordings in 13131313 European cities and corresponding perceptual responses collected using the Soundscape Indices (SSID) Protocol [53]. The perceptual responses collected using the SSID Protocol were largely inspired by the Method A questionnaire in ISO/TS 12913-2:2018 [20].

However, the relatively smaller sample sizes of these datasets, as compared to large-scale datasets like AudioSet, may preclude their use in developing high-parameter models, such as deep neural networks, which have shown state-of-the-art performance in various “objective” acoustic tasks. Hence, the ARAUS dataset was designed at a relatively large scale to be amenable to high-parameter models.

3 Data Collection Methodology

Since the primary focus of this study is to create a database of affective responses to augmented soundscapes that is extensive yet extensible, the data collection methodology must necessarily be modular and repeatable to ensure valid analysis of results and enable possible future extensions to the dataset, similar to the philosophy of USotW [34] and ISD [52].

Hence, the ARAUS dataset was designed as a five-fold cross-validation dataset with an independent test set, such that additional folds or data for each fold can be added following the same data collection methodology described in this section. Such a design also allows for each fold to be treated as an independent dataset for training of ensemble models or meta-learning.

To design the cross-validation set, we prepared separate recordings of real-life “base” urban soundscapes and of maskers that could potentially be used to augment those “base” urban soundscapes. The same procedure was used to split the urban soundscape and masker recordings independently into five folds to ensure no data leakage, and combine them within their folds into augmented soundscapes that form the audio-visual stimuli in the ARAUS dataset. The audio-visual stimuli were presented to participants in laboratory conditions and their affective responses to the stimuli were collected. However, responses belonging to participants who responded in an “overly inconsistent” manner were dropped from the dataset (see Section 3.9) and new participants were recruited to replace the dropped responses such that each fold in the five-fold cross-validation set had an equal number of data samples. A summary of the data collection methodology is shown in Fig. 1, and full details of each step are provided in the following subsections.

The test set was designed in the same manner as the cross-validation set, but used no urban soundscapes, maskers, or participants already in the cross-validation set.

Refer to caption
Figure 1: Framework of the study methodology.

3.1 Base Urban Soundscapes

For the five-fold cross-validation set, all base urban soundscapes were taken from the Urban Soundscapes of the World (USotW) database [34]. The USotW database contains 127127127127 publicly available recordings of urban soundscapes covering an extensive variety of urban environments from various cities around the world. The urban environments range from parks to busy streets, and the cities include those in Asia, Europe, and North America, which allows the ARAUS dataset to be broad-ranging in its coverage of real-life urban soundscapes. The urban soundscapes in the USotW dataset were also chosen by the USotW team via clustering of locations reported by local experts to be “full of life and exciting”, “chaotic and restless”, “calm and tranquil”, and “lifeless and boring”, which are adjectives spanning the perceptual space generated by the “Pleasantness” and “Eventfulness” axes of the ISO/TS 12913-3:2019 circumplex model [11]. This heightens the suitability of the USotW database for use in the investigation of the affective qualities of soundscapes that the ARAUS dataset aims to enable.

Each 60606060-second binaural recording in the USotW database was split into two halves of 30 stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG for the creation of the audio-visual stimuli in the ARAUS dataset. Consequently, the audio-visual stimuli in the ARAUS dataset are all 30 stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG in length, which is in line with the stimulus length used in several soundscape studies, such as those in [54, 55, 56]. Upon splitting the recordings into two halves, we discarded any half with (1) audible electrical noise (such as those caused by a faulty microphone or loose connection), in order to reflect only accurately-captured real-life soundscapes; (2) measured in-situ LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT values below 52 dBtimes52decibel52\text{\,}\mathrm{dB}start_ARG 52 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, in order to ensure that reproduction levels were significantly above the noise floor of the laboratory location with the highest noise floor (about 37 dBtimes37decibel37\text{\,}\mathrm{dB}start_ARG 37 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG; see Fig. 3) where the subjective responses were obtained; and/or (3) measured in-situ LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT values above 77 dBtimes77decibel77\text{\,}\mathrm{dB}start_ARG 77 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG in order to ensure safe listening levels [57] for the participants.

For each half of the binaural recordings that was not discarded, a 0 °times0degree0\text{\,}\mathrm{\SIUnitSymbolDegree}start_ARG 0 end_ARG start_ARG times end_ARG start_ARG ° end_ARG-azimuth, 0 °times0degree0\text{\,}\mathrm{\SIUnitSymbolDegree}start_ARG 0 end_ARG start_ARG times end_ARG start_ARG ° end_ARG-elevation field of view (FoV) was cropped out of their corresponding 360 °times360degree360\text{\,}\mathrm{\SIUnitSymbolDegree}start_ARG 360 end_ARG start_ARG times end_ARG start_ARG ° end_ARG-videos from the USotW database to form the audio-visual stimulus corresponding to that urban soundscape.

For the test set, six urban soundscapes were recorded in locations within Nanyang Technological University (NTU), Singapore. The recordings were made in a similar manner as those in the USotW database and were made using equipment in accordance with the SSID Protocol [53]. Post-processing of the test set recordings for use as part of the stimuli for the ARAUS dataset was done in the same manner as that for the five-fold cross-validation set.

In total, this formed a base of 234234234234 urban soundscape recordings for the five-fold cross-validation set of the ARAUS dataset, and an additional base of six soundscapes (not overlapping with the other 234234234234) for the independent test set. Further details on the exact recordings used can be found in Appendix A.

3.2 Maskers

The masker recordings for both the five-fold cross-validation set and the independent test set were derived from source tracks found in the public databases Freesound [32] and Xeno-canto [58]. Both databases host tracks with Creative Commons licenses, with Xeno-canto being a repository of bird calls and Freesound being a more general repository of sound samples and recordings.

The source tracks from both databases that we determined to be relevant to the ARAUS dataset fell into one of the following classes: bird, construction, traffic, water, and wind. Water [59, 60, 61], bird [54, 61, 62], and wind [63] sounds have previously been investigated in soundscape studies as natural-sound maskers. On the other hand, sounds from traffic and construction are ubiquitous noise sources in urban environments [64], and are commonly investigated in soundscape literature [65, 66, 67]. Therefore, the selection of maskers covers a variety of urban sounds investigated in the soundscape literature for the ARAUS dataset.

The source tracks for maskers corresponding to the “bird” class were first obtained by randomly picking a selection of high-quality tracks of birds on Xeno-canto. Each source track corresponded to bird(s) from a single species, as labeled on Xeno-canto. Additional tracks for the “bird” class and all other classes were obtained via the corresponding search term on Freesound, and picking a selection of “high-quality” tracks containing 30303030-second sections of sound that corresponded only to that particular masker class, as determined by manual listening. However, the exact number of sources of a single masker class present in a given track was variable, so for instance, a given track of the “bird” class could contain vocalizations from one, two, or more birds, but all tracks of the “bird” class contain only bird vocalizations. Further details on the source tracks used to create the maskers are given in Appendix A.

Each source track was then processed individually to create 30-second single-channel masker recordings. Single-channel recordings of maskers were used because [68] previously found single-channel recordings to be sufficient to replicate the perceived affective quality of soundscapes, which the ARAUS dataset aims to collect responses for. For source tracks that were originally multi-channel, only the first channel was used for consistency. Source tracks originally longer than 30 stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG were trimmed to 30 stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG, while those originally shorter than 30 stimes30second30\text{\,}\mathrm{s}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG were either padded with silence or looped. Finally, noise reduction via spectral gating and high-pass filtering was performed for source tracks in the “bird” class to reduce ambient and/or microphone noise in the track. All pre-processing was done manually using Audacity (v2.3.2).

In total, this formed a set of 280280280280 masker candidates (56565656 per fold) that were used to generate the stimuli for the five-fold cross-validation set, and a set of seven maskers for the independent test set. The breakdown of the number of masker recordings by class was 80808080 bird, 40404040 construction, 40404040 traffic, 80808080 water, and 40404040 wind for the cross-validation set; and 2222 bird, 1111 construction, 1111 traffic, 1111 water, 2222 wind for the independent test set.

3.3 Fold Allocation

After preparing the urban soundscape recordings and maskers in Sections 3.1 and 3.2, the tracks were assigned into the five folds of the cross-validation set such that the distributions of psychoacoustic properties of the urban soundscapes and maskers were similar across the five folds. Since psychoacoustic indicators of a given soundscape have non-trivial and non-spurious correlations with corresponding perceptual indicators [69], taking psychoacoustic indicators into account was necessary to minimize distributional shifts across the folds.

The assignment procedure consisted of the following steps carried out for the urban soundscape recordings, and independently each class of masker tracks.

3.3.1 Track Calibration

Each recording track was calibrated to a pre-defined A-weighted equivalent sound pressure level (LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT). For the base urban soundscapes, the in-situ LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT measured at the time of recording for the urban soundscape recordings was used, while a constant value of 65 dBtimes65decibel65\text{\,}\mathrm{dB}start_ARG 65 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG was used for the maskers, similar to [70].

3.3.2 Acoustic and Psychoacoustic Indicator Computation

For all recordings, summary statistics for acoustic and psychoacoustic indicators, as recommended by ISO/TS 12913-3:2019 [11], were calculated independently for each channel using ArtemiS SUITE (HEAD Acoustics). The indicators comprised sharpness [71], loudness [72], fluctuation strength [73], roughness [73], tonality [74, 75], LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT [76], and C-weighted equivalent sound pressure level (LC,eqsubscript𝐿C,eqL_{\text{C,eq}}italic_L start_POSTSUBSCRIPT C,eq end_POSTSUBSCRIPT) [76]. Finally, band powers summed over third-octave bands with center frequencies from 5 Hztimes5hertz5\text{\,}\mathrm{Hz}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG to 20 kHztimes20kilohertz20\text{\,}\mathrm{kHz}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG were also calculated. Table II shows the summary statistics calculated for each indicator.

With the exception of tonality in [74], the MATLAB Audio Toolbox111https://www.mathworks.com/products/audio.html provides standards-compliant implementations of all other psychoacoustic indicators used. Open-source implementations of the psychoacoustic indicators used are either currently available, or have been indicated as planned, as part of the MOSQITO Toolbox [77].

TABLE II: Acoustic and psychoacoustic indicators used as channel-wise summary statistics. The set of summary statistics indicated as “common” were the mean, maximum, exceedance levels for the 5th percentile, exceedance levels for each decile, and exceedance levels for the 95th percentile. Minimum values for sharpness, loudness, fluctuation strength, roughness, and tonality were zero for all stimuli and hence omitted from analysis.
Indicator Unit Summary statistics
Sharpness [71] acum common
Loudness [72] sone common + root mean cube
Fluctuation strength [73] vacil common
Roughness [73] asper common
Tonality [74, 75] tuHMS common
LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT [76] dB common + minimum
LC,eqsubscript𝐿C,eqL_{\text{C,eq}}italic_L start_POSTSUBSCRIPT C,eq end_POSTSUBSCRIPT [76] dB common + minimum
Spectral powers dB third-octave band-wise sum
(center freq. 5 Hztimes5hertz5\text{\,}\mathrm{Hz}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG to 20 kHztimes20kilohertz20\text{\,}\mathrm{kHz}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG)

3.3.3 Dimensionality Reduction

The summary statistics were then used as individual input features to a principal component analysis (PCA), and to project each recording to a principal component space with enough dimensions to achieve 90 %times90percent90\text{\,}\mathrm{\char 37\relax}start_ARG 90 end_ARG start_ARG times end_ARG start_ARG % end_ARG explained variance. Further details on the PCA can be found in Appendix D.

The primary reason for using PCA in the assignment of folds was to remove correlations between multiple variables, which is desirable for noise and dimensionality reduction prior to clustering as illustrated by [78], which removed correlated variables when clustering of soundscape recordings across 8 acoustic indices.

3.3.4 Clustering and Fold Assignment

With the coordinates of each recording in the principal component space, the recordings were organized into clusters of five using a self-organizing map (SOM) [79].

For each cluster of five recordings, each recording in the cluster was randomly assigned to a distinct fold of the cross-validation set. To prevent data leakage, the assignment was done based on psychoacoustic indicators computed from their original 60606060-second long binaural recordings from the USotW database, such that the 30303030-second halves in the ARAUS dataset originating from the same original binaural recording were always assigned to the same fold.

3.4 Generation of Stimuli

Each stimulus in the ARAUS dataset is an augmented soundscape to be presented as a 30303030-second audio-visual stimulus to a human participant, and the procedure used to generate each stimulus is shown in Fig. 2.

Refer to caption
Figure 2: Illustration of stimulus generation procedure for a single stimulus.

The audio for the augmented soundscapes was made by combining the 30303030-second binaural recordings of urban soundscapes in Section 3.1 with the 30303030-second single-channel recordings of maskers obtained in Section 3.2 at various gain levels, via element-wise addition of their respective gain-adjusted time-domain signals. For the purposes of stimulus generation, we also included silence in the set of possible maskers, since the addition of silence simply replicates the condition where no masker is added.

For the five-fold cross-validation set, the urban soundscapes and maskers were chosen randomly from the same fold for combination to prevent data leakage and provide a sample of all possible combinations. Since the fold allocation procedure described in Section 3.3 ensured that the sets of urban soundscapes and maskers used for each fold of the cross-validation set were disjoint, the augmented soundscapes generated from them were disjoint between each fold of the cross-validation set as well. On the other hand, for the independent test set, the urban soundscapes and maskers were exhaustively combined to cover all 48484848 possible combinations.

Before combining the urban soundscape and masker recordings for the five-fold cross-validation set, the urban soundscape recordings were calibrated to the in-situ LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT levels measured at the time of recording, and the maskers were calibrated to specific soundscape-to-masker ratios (SMR) with respect to the urban soundscape, chosen randomly from the set {{\{{6-6-6- 6, 3-3-3- 3, 00, +3+3+3+ 3, +6+6+6+ 6}}\}}  dBtimesabsentdecibel\text{\,}\mathrm{dB}start_ARG end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG. For example, if the urban soundscape recording had an in-situ LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT of 65 dBtimes65decibel65\text{\,}\mathrm{dB}start_ARG 65 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG and an SMR of +3 dBtimes+3decibel+3\text{\,}\mathrm{dB}start_ARG + 3 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG was randomly chosen, then the masker would be calibrated to an LA,eqsubscript𝐿A,eqL_{\text{A,eq}}italic_L start_POSTSUBSCRIPT A,eq end_POSTSUBSCRIPT of 62 dBtimes62decibel62\text{\,}\mathrm{dB}start_ARG 62 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG. This gave a total of 65 5206552065\,52065 520 possible augmented soundscapes from which we sampled for the ARAUS dataset. All calibration was done via the automated method described by [80]. On the other hand, for the independent test set, a fixed SMR of 0 dBtimes0decibel0\text{\,}\mathrm{dB}start_ARG 0 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG was used for all combinations. This was to limit the number of stimuli presented to each participant in the test set and the effect of listener fatigue, since the participants assigned to the test set were required to rate all combinations exhaustively, as explained in Section 3.7.

Lastly, the single-channel recordings were added to each channel of the binaural tracks, in a manner similar to that in [81] with a fixed stereo panning coefficient of 0.5. This audio was then overlaid onto the 0 °times0degree0\text{\,}\mathrm{\SIUnitSymbolDegree}start_ARG 0 end_ARG start_ARG times end_ARG start_ARG ° end_ARG-azimuth, 0 °times0degree0\text{\,}\mathrm{\SIUnitSymbolDegree}start_ARG 0 end_ARG start_ARG times end_ARG start_ARG ° end_ARG-elevation field of view cropped from the 360 °times360degree360\text{\,}\mathrm{\SIUnitSymbolDegree}start_ARG 360 end_ARG start_ARG times end_ARG start_ARG ° end_ARG video recordings from the USotW database taken at the same time as the binaural urban soundscape recordings to form the audio-visual stimuli.

3.5 Participant Recruitment

Prior ethical approval was obtained from the Institutional Review Board, NTU (Ref. IRB 2020-08-035) before participant recruitment and response collection. Participants were recruited via online messaging channels, posters, and emails.

In total, 642642642642 unique participants were recruited to provide their responses for the ARAUS dataset, of which 37373737 (5.76 %times5.76percent5.76\text{\,}\mathrm{\char 37\relax}start_ARG 5.76 end_ARG start_ARG times end_ARG start_ARG % end_ARG) had their responses rejected. Hence, responses from only 605605605605 participants were included in the final dataset. We rejected responses from participants who (1) failed a hearing test (19191919 participants, 2.96 %times2.96percent2.96\text{\,}\mathrm{\char 37\relax}start_ARG 2.96 end_ARG start_ARG times end_ARG start_ARG % end_ARG); (2) failed more than three out of seven consistency checks described in Section 3.9 (17171717 participants, 2.65 %times2.65percent2.65\text{\,}\mathrm{\char 37\relax}start_ARG 2.65 end_ARG start_ARG times end_ARG start_ARG % end_ARG); or (3) provided the same responses to any item in the Affective Response Questionnaire (ARQ) described in Section 3.7 for all stimuli they were presented, thereby providing no useful information to the dataset (3333 participants, 0.47 %times0.47percent0.47\text{\,}\mathrm{\char 37\relax}start_ARG 0.47 end_ARG start_ARG times end_ARG start_ARG % end_ARG). Two of the rejected participants failed both the hearing test and more than three out of seven consistency checks.

Participants aged under 30303030 were considered to have failed the hearing test if they had a mean threshold of hearing above 20 dBtimes20decibel20\text{\,}\mathrm{dB}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG and participants aged 30303030 and above were considered to have failed if they had a mean threshold of hearing above 30 dBtimes30decibel30\text{\,}\mathrm{dB}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG via pure-tone audiometry using the uHear application on a mobile phone (Apple iPhone 4S) and earbuds (Apple EarPods). The tested frequencies were 0.50.50.50.5, 1111, 2222, 4444, and 6 kHztimes6kilohertz6\text{\,}\mathrm{kHz}start_ARG 6 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG. These were within the standard ranges used for screening in pure tone audiometry [82] and previous soundscape research [83]. The higher threshold of 30 dBtimes30decibel30\text{\,}\mathrm{dB}start_ARG 30 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG was applied to participants aged 30303030 and above to balance the risk of age bias (since age is highly correlated with hearing ability [84]) against the need to ensure that hearing loss did not interfere with the perception of the augmented soundscapes in the ARAUS dataset.

The participants were each assigned a fold such that each fold of the five-fold cross-validation set had responses from 120120120120 participants. The independent test set had responses from 5555 participants. The age and gender distributions of the participants are shown in Table 3, and further information on participant demographics can be found in Appendix C. Due to the test sites for the data collection process being located within university campuses (as shown in Fig. 3), a majority of the 605 participants whose responses were included in the ARAUS dataset were students (443443443443 participants, 73.2 %times73.2percent73.2\text{\,}\mathrm{\char 37\relax}start_ARG 73.2 end_ARG start_ARG times end_ARG start_ARG % end_ARG) who were in the process of obtaining their bachelor’s degree (380380380380 participants, 62.8 %times62.8percent62.8\text{\,}\mathrm{\char 37\relax}start_ARG 62.8 end_ARG start_ARG times end_ARG start_ARG % end_ARG). Participants were also relatively young (mean age 26.726.726.726.7 years, standard deviation 10.010.010.010.0 years) compared to those in the ISD (mean age 33.833.833.833.8 years, standard deviation 14.5714.5714.5714.57 years) [85], but slightly older compared to those in the IADS-2 (college students) [46] and IADS-E (mean age 21.32 years, standard deviation 2.38 years) [48] datasets.

TABLE III: Age and gender333To safeguard their identities, the small number (2 participants, 0.33 %times0.33percent0.33\text{\,}\mathrm{\char 37\relax}start_ARG 0.33 end_ARG start_ARG times end_ARG start_ARG % end_ARG) of participants identifying as neither male nor female were randomly assigned as “male” or “female” in the public release of the ARAUS dataset, each with a probability of 50 %times50percent50\text{\,}\mathrm{\char 37\relax}start_ARG 50 end_ARG start_ARG times end_ARG start_ARG % end_ARG. distribution of participants
in ARAUS dataset by fold and in entirety.
Fold
Statistic Test 1 2 3 4 5 All
Sample size 5 120 120 120 120 120 605
# female 4 62 56 69 68 69 328
# male 1 58 64 51 52 51 277
Mean age 22.4 27.4 26.7 26.3 27.3 26.3 26.7
Std. dev. of age 5.5 10.2 9.8 10.4 11.0 8.7 10.0
Minimum age 18 19 18 18 18 18 18
Median age 21 24 24 23 23 24 24
Maximum age 32 63 71 68 65 60 71

3.6 Listening Conditions

Participants were presented with all audio-visual stimuli via closed-back headphones (Beyerdynamic Custom One Pro) powered by an external sound card (SoundBlaster E5). The video was presented via a 23232323-inch monitor (Philips 236E SoftBlue) and measured 21.5 cmtimes21.5centimeter21.5\text{\,}\mathrm{cm}start_ARG 21.5 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG by 12 cmtimes12centimeter12\text{\,}\mathrm{cm}start_ARG 12 end_ARG start_ARG times end_ARG start_ARG roman_cm end_ARG on the screen. Participants sat facing about 1 mtimes1meter1\text{\,}\mathrm{m}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG roman_m end_ARG from the monitor.

Due to the large number of participants involved in the experiment, participants listened to the stimuli in one of four quiet rooms, three located in NTU and one in the Singapore University of Technology and Design (SUTD). Photos of each of the four quiet rooms are shown in Fig. 3.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Test sites at (top left) Academic Media Studio, SUTD, (top right) Media Technology Laboratory, NTU, (bottom left) Demo Room, NTU, (bottom right) Interactive Soundscape Room, NTU. Their noise floors, measured as LA,eq,3-minsubscript𝐿A,eq,3-minL_{\text{A,eq,3-min}}italic_L start_POSTSUBSCRIPT A,eq,3-min end_POSTSUBSCRIPT values with a B&K Sound Level Meter Type 2240, were 20.6 dBtimes20.6decibel20.6\text{\,}\mathrm{dB}start_ARG 20.6 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, 26.0 dBtimes26.0decibel26.0\text{\,}\mathrm{dB}start_ARG 26.0 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, 36.9 dBtimes36.9decibel36.9\text{\,}\mathrm{dB}start_ARG 36.9 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, and 30.2 dBtimes30.2decibel30.2\text{\,}\mathrm{dB}start_ARG 30.2 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, respectively.

3.7 Affective Response Questionnaire

For each audio-visual stimulus, participants were instructed to perform their evaluations given the following prompt:

Imagine that you are standing at the location shown in the video, listening to the sound environment playing through the headphones.

After completely experiencing the stimulus at least once, participants were presented with a set of 9999 questions from the Method A questionnaire in ISO/TS 12913-2:2018 [20], which we refer to as the “Affective Response Questionnaire” (ARQ). The first 8888 questions were related to the perceived affective quality of the sound they heard over the headphones:

To what extent do you agree or disagree that the present surrounding sound environment is {pleasant, eventful, chaotic, vibrant, uneventful, calm, annoying, monotonous}?

The last question was related to the appropriateness of the location depicted in the video they saw on the monitor with respect to the sound they heard over the headphones:

To what extent is the present surrounding sound environment appropriate to the pleasant place?

Participants responded to all questions on a five-point Likert scale via a computerized graphical user interface (GUI) depicted in Fig. 4, and we coded their responses to values in {1,2,3,4,5}12345\{1,2,3,4,5\}{ 1 , 2 , 3 , 4 , 5 } to match the scale.

Refer to caption
Figure 4: GUI used to administer the ARQ for the ARAUS dataset.

For the cross-validation set, participants responded to the ARQ for 42424242 unique, randomly-selected stimuli from the fold that they were assigned to. For the independent test set, all participants responded to the ARQ for the same 48 exhaustively-generated stimuli.

In addition to the “main” 42424242 (48484848) stimuli shown to each participant in the cross-validation (test) set, we presented each participant with three auxiliary stimuli. Participants were also required to provide responses to the ARQ for these stimuli, but the responses are not part of the ARAUS dataset, because these stimuli were the same for every participant, and did not serve the same purpose as the main stimuli in the ARAUS dataset. These stimuli were namely:

  1. (1)

    A “pre-experiment” stimulus, which was shown as the first stimulus before presenting the main stimuli for the cross-validation/test set.

  2. (2)

    An “attention” stimulus, which was identical to the “pre-experiment” stimulus and shown in between two randomly selected “main” stimuli for the cross-validation/test set. For this stimulus, the GUI had special instructions for the participant to choose specific options for the ARQ before they could proceed.

  3. (3)

    A “post-experiment” stimulus, which was identical to the “pre-experiment” stimulus and shown as the last stimulus after presenting the main stimuli for the cross-validation/test set. ARQ responses to this stimulus were used for internal consistency checks, as described in Section 3.9.

3.8 Participant Information Questionnaire

Since the ISO 12913-1:2014 definition of “soundscape” mandates that acoustic environments must be perceived in context [86], information specific to the listeners who participated in the study was also necessary to understand the context behind listener perceptions of the augmented soundscapes presented as stimuli. Hence, we administered another questionnaire, which we dub the “Participant Information Questionnaire” (PIQ), to each participant after they had completed the ARQs for the stimuli shown to them. The PIQ consisted of items related to basic demographic information and standard psychological questionnaires.

Items related to basic demographic information consisted of age, gender, spoken languages, citizenship status, education status, occupational status, dwelling type, ethnicity, and length of residence of the participant.

The psychological questionnaires administered were (1) a shortened, 10101010-item version of the Weinstein Noise Sensitivity Scale (WNSS-10) [87]; (2) a shortened, 10101010-item version of the Perceived Stress Scale (PSS-10) [88]; (3) the WHO-5 Well-Being Index [89]; and (4) the Positive and Negative Affect Schedule (PANAS) [90].

The relevance of the questionnaires is evident in the fact that noise sensitivity [91] and stress level [92] play a significant role in the affective perception of soundscapes, the WHO-5 Well-Being Index is used in the Soundscape Indices Protocol (SSID) [53], and [93] previously used PANAS as a measure of participants’ mood in a study of the emotional salience of sounds in soundscapes. The PSS-10 has been validated by the same authors of the original 14-item questionnaire [94], but the particular version (WNSS-10) of the Weinstein Noise Sensitivity Scale used for the ARAUS dataset has not previously been validated in the literature. Nonetheless, the internal reliability of WNSS-10 is affirmed based on the results of the analysis obtained in Section 4.2.

Full details of the PIQ with individual questions, options, and response coding for each section of the questionnaire can be found in Appendix B.

3.9 Consistency Checks

In order to ensure a baseline level of data quality in the responses to the ARQ, seven consistency checks were performed on each participant’s responses. The consistency checks were designed as single-value metrics, and a participant was considered to have failed a consistency check if the corresponding value of the metric was at least 1111. If a participant failed more than three out of seven consistency checks, their responses were dropped from the dataset.

To describe the single-value metrics, we first define rpl,rev,rch,rvi,rue,rca,ran,rmo,rap{1,2,3,4,5}subscript𝑟plsubscript𝑟evsubscript𝑟chsubscript𝑟visubscript𝑟uesubscript𝑟casubscript𝑟ansubscript𝑟mosubscript𝑟ap12345r_{\text{pl}},r_{\text{ev}},r_{\text{ch}},r_{\text{vi}},r_{\text{ue}},r_{\text% {ca}},r_{\text{an}},r_{\text{mo}},r_{\text{ap}}\in\{1,2,3,4,5\}italic_r start_POSTSUBSCRIPT pl end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT vi end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT ue end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT ca end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT an end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT mo end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT ap end_POSTSUBSCRIPT ∈ { 1 , 2 , 3 , 4 , 5 } as the responses to the ARQ in Section 3.7 regarding the extent to which the sound environment in a given stimulus was respectively pleasant, eventful, chaotic, vibrant, uneventful, calm, annoying, monotonous, and appropriate.

The seven consistency checks consist of three types of checks. The first check measures the mean absolute difference (MAD) between ARQ responses on the “pre-experiment” and “post-experiment” stimuli. Since the “pre-experiment” and “post-experiment” stimuli were identical, a perfectly consistent participant would have provided the same responses to both presentations.

Since the affective descriptors “pleasant” and “annoying” are on opposite axes of the ISO/TS 12913-3:2019 circumplex model of soundscape perception [11], a perfectly consistent participant would have provided the same response for “pleasant” and “annoying” if the Likert coding of either were to be reversed. The same considerations apply to the other three pairs of opposite attributes on the circumplex model. Therefore, the next four checks consider the MADs between the four pairs of rpsubscript𝑟𝑝r_{p}italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and (6rq)6subscript𝑟𝑞(6-r_{q})( 6 - italic_r start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ), where (p,q)𝑝𝑞(p,q)( italic_p , italic_q ) is a pair of opposite descriptors.

Lastly, the mean squared error (MSE) between rplsubscript𝑟plr_{\text{pl}}italic_r start_POSTSUBSCRIPT pl end_POSTSUBSCRIPT and (3+2P)32𝑃(3+2P)( 3 + 2 italic_P ), and that between revsubscript𝑟evr_{\text{ev}}italic_r start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT and (3+2E)32𝐸(3+2E)( 3 + 2 italic_E ) across all stimuli presented was computed, where

P𝑃\displaystyle Pitalic_P =k1(2rpl2ran+rcarch+rvirmo),absentsuperscript𝑘12subscript𝑟pl2subscript𝑟ansubscript𝑟casubscript𝑟chsubscript𝑟visubscript𝑟mo\displaystyle=k^{-1}\left(\sqrt{2}r_{\text{pl}}-\sqrt{2}r_{\text{an}}+r_{\text% {ca}}-r_{\text{ch}}+r_{\text{vi}}-r_{\text{mo}}\right),= italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( square-root start_ARG 2 end_ARG italic_r start_POSTSUBSCRIPT pl end_POSTSUBSCRIPT - square-root start_ARG 2 end_ARG italic_r start_POSTSUBSCRIPT an end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT ca end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT vi end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT mo end_POSTSUBSCRIPT ) , (1)
E𝐸\displaystyle Eitalic_E =k1(2rev2ruerca+rch+rvirmo),absentsuperscript𝑘12subscript𝑟ev2subscript𝑟uesubscript𝑟casubscript𝑟chsubscript𝑟visubscript𝑟mo\displaystyle=k^{-1}\left(\sqrt{2}r_{\text{ev}}-\sqrt{2}r_{\text{ue}}-r_{\text% {ca}}+r_{\text{ch}}+r_{\text{vi}}-r_{\text{mo}}\right),= italic_k start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( square-root start_ARG 2 end_ARG italic_r start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT - square-root start_ARG 2 end_ARG italic_r start_POSTSUBSCRIPT ue end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT ca end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT vi end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT mo end_POSTSUBSCRIPT ) , (2)

respectively are the normalized values of “ISO Pleasantness” and “ISO Eventfulness” as suggested in [95], and k=8+32𝑘832k=8+\sqrt{32}italic_k = 8 + square-root start_ARG 32 end_ARG is a normalization constant such that P,E[1,1]𝑃𝐸11P,E\in[-1,1]italic_P , italic_E ∈ [ - 1 , 1 ]. Since the affective descriptor “pleasant” theoretically parallels the principal axis of “Pleasantness” in the ISO/TS 12913-3:2019 circumplex model of soundscape perception, a perfectly consistent participant would have provided responses matching the magnitude in both directions. The rescaling of P𝑃Pitalic_P as (3+2P)32𝑃(3+2P)( 3 + 2 italic_P ) was necessary to match the range of values of rplsubscript𝑟plr_{\text{pl}}italic_r start_POSTSUBSCRIPT pl end_POSTSUBSCRIPT for a valid comparison. Similar considerations apply for the affective descriptor “eventful” to the principal axis “Eventfulness”.

4 Data Analysis

After the data collection described in Section 3 had been completed, we performed analyses to verify data quality and empirical consistency with known literature.

Firstly, the ARQ responses were analyzed to compare the effect of different maskers on the normalized ISO Pleasantness P𝑃Pitalic_P of the augmented soundscapes, and validate that the stimuli in the ARAUS dataset spanned the perceptual space generated by the ISO Pleasantness and ISO Eventfulness axes. Statistical and internal reliability tests were then performed on the PIQ responses and consistency check metrics to ensure that the distribution of data and responses in each fold of the cross-validation set did not significantly differ. This allowed us to assess the degree to which the methodological efforts to minimize domain shift between folds were successful.

4.1 Affective Response Questionnaire

To investigate the effect each masker had on the ISO Pleasantness of the urban soundscapes it was augmented to, we first used the ARQ responses to compute the ISO Pleasantness values for all the stimuli in the ARAUS dataset using Equation (1). The difference in ISO Pleasantness for each base urban soundscape (i.e., augmented with silence) and the same urban soundscape augmented with each masker was computed. These differences were averaged across all soundscapes for which the same masker, regardless of the SMR, was presented to obtain the mean change in ISO Pleasantness effected by each masker across different soundscapes. This allowed us to determine which maskers were optimal for augmentation, at least on average in a naive sense across the urban soundscapes in the ARAUS dataset.

Figure 5 shows the mean change in ISO Pleasantness value as a function of masker used to augment soundscape, aggregated over soundscapes and SMRs used. Out of the 287287287287 maskers in the ARAUS dataset, mean positive changes in ISO Pleasantness were only observed in maskers belonging to the bird and water classes. However, not all of the bird and water maskers showed mean positive changes, corroborating the findings by [92] that not all natural sounds are necessarily perceived as pleasant. Nevertheless, augmentation with 64646464 (78.0 %times78.0percent78.0\text{\,}\mathrm{\char 37\relax}start_ARG 78.0 end_ARG start_ARG times end_ARG start_ARG % end_ARG) of the bird maskers and 16161616 (19.5 %times19.5percent19.5\text{\,}\mathrm{\char 37\relax}start_ARG 19.5 end_ARG start_ARG times end_ARG start_ARG % end_ARG) of the water maskers, each out of 82828282, resulted in effective mean positive changes in ISO Pleasantness, which supports findings in [14, 54, 70] where specific bird and water sounds were found to have improved the perceived pleasantness of urban soundscapes.

Refer to caption
Figure 5: Mean change in ISO Pleasantness value as a function of each of the 287 (280 cross-validation, 7 test set) maskers used to augment soundscapes in the ARAUS dataset, aggregated over soundscapes and SMRs used.

In contrast, mean negative changes in ISO Pleasantness were observed for all wind, traffic, and construction maskers in the ARAUS dataset. While the decrease in ISO Pleasantness due to the addition of traffic [96] and construction [97] maskers is expected, the decrease in ISO Pleasantness for all wind sounds used as maskers was contrary to the results in narrative interviews reported by [92] that people tended to perceive wind rustling through trees as pleasant. This could be due to the range of SMRs used for the ARAUS dataset, which ranged only from 6-6-6- 6 to +6 dBtimes+6decibel+6\text{\,}\mathrm{dB}start_ARG + 6 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG. The mean in-situ LA,eqsubscript𝐿𝐴𝑒𝑞L_{A,eq}italic_L start_POSTSUBSCRIPT italic_A , italic_e italic_q end_POSTSUBSCRIPT of the tracks in the USotW database used for the ARAUS dataset was about 65 dBtimes65decibel65\text{\,}\mathrm{dB}start_ARG 65 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG, but natural wind sounds in real-life environments tend to have a mean LA,eqsubscript𝐿𝐴𝑒𝑞L_{A,eq}italic_L start_POSTSUBSCRIPT italic_A , italic_e italic_q end_POSTSUBSCRIPT of about 55 dBtimes55decibel55\text{\,}\mathrm{dB}start_ARG 55 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG or less [98], which means that an SMR of +10 dBtimes+10decibel+10\text{\,}\mathrm{dB}start_ARG + 10 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG or higher may have been more appropriate to achieve an increase in perceived pleasantness for wind maskers instead. At the SMRs used for the ARAUS dataset, the wind maskers may have been added at overly high levels, rendering them to be perceived similarly to the traffic maskers due to their similar spectral characteristics and potentially contributing to the decrease in ISO Pleasantness. Additionally, the laboratory-based nature of the data collection process caused the wind maskers to be heard without perceiving natural movements of the air that would be present in an in-situ study, which could have resulted in an artificial or unpleasant situation for the participants.

To investigate how well the ARAUS dataset stimuli and the base urban soundscapes from the USotW database spanned the perceptual space generated by the “Pleasantness–Eventfulness” axes, we first computed the normalized values of P𝑃Pitalic_P and E𝐸Eitalic_E according to Equations (1) and (2) for each individual ARQ response in the ARAUS dataset. Then, we plotted a heat map and scatter plot of the responses on the P𝑃Pitalic_PE𝐸Eitalic_E axes, as shown in Fig. 6. We can see that both the ARAUS dataset stimuli and the USotW soundscapes covered the positive, neutral, and negative regions of both the ISO Pleasantness and the ISO Eventfulness axes, thereby validating their use in analysis related to the ISO 12913 standard.

Refer to caption
Refer to caption
Figure 6: Heat map (grayscale boxes) and scatter plot (yellow points) of ARQ responses in the “ISO Pleasantness” and “ISO Eventfulness” axes to (top) ARAUS dataset stimuli, (bottom) USotW soundscapes in ARAUS dataset. The brightness of grayscale boxes denotes the proportion of responses belonging to that region.

4.2 Participant Information Questionnaire

Since items in the PIQ such as the participant’s age [99] and education status [100] can affect the affective perception of soundscapes, the participants in each fold of the cross-validation set of the ARAUS dataset should ideally be drawn from distributions with similar demographics. This was not enforced during the participant recruitment and fold allocation process, thus a post hoc analysis of the PIQ responses was performed to assess the demographic distributions. The post hoc analysis was performed via standard statistical tests for equality of distributions.

For items in the PIQ coded as categorical variables, χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-tests were conducted between the responses obtained in each fold of the cross-validation set (treated as observed frequencies) and those obtained in the entire cross-validation set (treated as expected frequencies). No significant differences were observed, at 5 %times5percent5\text{\,}\mathrm{\char 37\relax}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG % end_ARG significance levels, in the distribution of all categorical variables between folds, with the lowest p𝑝pitalic_p-value being 0.10700.10700.10700.1070 for the participants’ gender.

For items in the PIQ coded as continuous variables, Kruskal-Wallis tests were conducted by treating each fold as an independent group. The Kruskal-Wallis tests showed no significant differences at 5 %times5percent5\text{\,}\mathrm{\char 37\relax}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG % end_ARG significance levels in the distribution of all continuous variables between folds, with the lowest p𝑝pitalic_p-value being 0.30900.30900.30900.3090 for the extent to which participants were annoyed by noise over the past 12121212 months.

Together, these results indicate that the distribution of participants in a given fold did not significantly differ from the participants in any other fold, which reinforces the validity of using the five-fold cross-validation set of the ARAUS dataset to generate models for populations sharing similar characteristics to that of the entirety of the cross-validation set. Full results of the statistical tests and explicit distributions of responses by fold and PIQ item are detailed in Appendix C.

For the psychological questionnaires used in the PIQ (WNSS-10, PSS-10, WHO-5, and PANAS), Cronbach’s α𝛼\alphaitalic_α [101] and McDonald’s ω𝜔\omegaitalic_ω [102] were computed as standard internal reliability coefficients to independently verify if their items in aggregate were indeed measuring the same construct. This served as a validation study for WNSS-10 and a replication study for the other questionnaires.

The internal reliability study results are shown in Table IV. All values of the internal reliability coefficients are above 0.8, which can be considered “high” given the number of items in each questionnaire [103]. Therefore, all items in each questionnaire indeed measured the same construct. In particular, the Cronbach’s α𝛼\alphaitalic_α value for of 0.8350.8350.8350.835 for WNSS-10 parallels similar validation studies for different iterations of the questionnaires of the original 21212121-item version [104], thereby confirming the reliability of the previously unvalidated version used for the ARAUS dataset.

TABLE IV: Reliability metrics for psychological questionnaires in the PIQ. Larger values are desirable, as denoted by the up arrow ()(\uparrow)( ↑ ).
Questionnaire Cronbach’s α()𝛼\alpha\hskip 2.84526pt(\uparrow)italic_α ( ↑ ) McDonald’s ω()𝜔\omega\hskip 2.84526pt(\uparrow)italic_ω ( ↑ )
WNSS-10 0.835 0.837
PSS-10 0.875 0.874
WHO-5 0.854 0.857
PANAS (Positive) 0.886 0.891
PANAS (Negative) 0.891 0.891

4.3 Consistency Checks

In a similar manner to the PIQ, Kruskal-Wallis tests for each of the seven single-value consistency metrics were performed by treating the ARQ responses in each fold as independent groups. Full results are presented in Appendix C.

No significant differences were observed, at 5 %times5percent5\text{\,}\mathrm{\char 37\relax}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG % end_ARG significance levels, in the distribution of all continuous variables between folds, with the lowest p𝑝pitalic_p-value being 0.25450.25450.25450.2545 for the MAD between “vibrant” and reversed “monotonous” ratings. This indicates that the distribution of response consistency in a given fold did not significantly differ from the responses in any other fold, reinforcing the validity of using the five-fold cross-validation set of the ARAUS dataset to generate generalizable and unbiased models, at least from the perspective of consistency with the ISO 12913-3:2019 circumplex model of soundscape perception.

5 Affective Response Modeling

To illustrate how the ARAUS dataset can be used for systematic and fair benchmarking of models for affective perception, four models were trained to predict the ISO Pleasantness, as defined in Equation (1), of a given augmented soundscape using the ARAUS dataset. The models were a relatively low-parameter regression model, a convolutional neural network (CNN), and two different probabilistic perceptual attribute predictor (PPAP) models [105, 106]. A dummy model that always predicts the mean of the training data labels was additionally used as a naive benchmark.

Using a five-fold cross-validation scheme, each model was trained five times, each with a different fold of the cross-validation set used as the validation set (5040504050405040 samples) and the remaining four folds used as the training set (20 1602016020\,16020 160 samples). For models sensitive to random initialization, results from 10101010 different seeds per validation fold were recorded, totalling 50505050 runs. After training, each model was evaluated on the independent test set (48484848 samples).

5.1 Models

5.1.1 Regularized Linear Regression

The linear regression models used acoustic and psychoacoustic indicators of an augmented soundscape as input features to predict its ISO Pleasantness. Firstly, the statistics detailed in Section 3.3.2 were computed for each binaural augmented soundscape in the ARAUS dataset, which gave 264264264264 possible input features to be used for regression. The features, without prior dimensionality reduction, were used to train elastic net models [107] with L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization weights of 0.50.50.50.5 and 0.250.250.250.25, respectively.

Elastic net models are designed for parameter sparsity, since the elastic net loss function will cause most weights to be set to zeroes after training concludes. This indirectly allows the method to automatically choose suitable parameters for regression from an initial larger selection, and makes it more computationally efficient than stepwise regression.

5.1.2 Convolutional Neural Network

The CNN models were trained to predict the ISO Pleasantness of an augmented soundscape using its log-mel spectrogram. Firstly, the channel-wise log-mel spectrograms were computed for each binaural augmented soundscape in the ARAUS dataset, with a Hann window length of 4096409640964096 samples, 50 %times50percent50\text{\,}\mathrm{\char 37\relax}start_ARG 50 end_ARG start_ARG times end_ARG start_ARG % end_ARG overlap between windows, and 64646464 mel frequency bands from 20 Hztimes20hertz20\text{\,}\mathrm{Hz}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG to 20 kHztimes20kilohertz20\text{\,}\mathrm{kHz}start_ARG 20 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG.

The log-mel spectrograms (as 644644644644-by-64646464-by-2222 tensors) were then used as input to the CNN models with the model architecture shown in Fig. 7. The architecture is identical to the baseline model architecture used for the acoustic scene classification task (Task 1B) of the DCASE 2020 Challenge [41]. However, due to the slightly larger input dimensions of the ARAUS data, the modified models shown in Fig. 7 contained about 142K parameters, compared to the original 116K in [41], with the only difference being the input dimensions of the final dense layer.

Refer to caption
Figure 7: Baseline CNN model architecture, adapted from [41]

The CNN models were trained over 100100100100 epochs with an Adam optimizer [108] with learning rate 1×1041E-41\text{\times}{10}^{-4}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG and batch size 32323232. The training was stopped early if the validation set MSE did not improve for 10101010 consecutive epochs. This process was repeated for 10101010 runs per validation fold to obtain the mean performance of the models.

5.1.3 Probabilistic Perceptual Attribute Predictor

The PPAP models were used to model the subjectivity in ARQ responses by outputting predictions for ISO Pleasantness based on probability distributions rather than deterministic values. As described in [105, 106], we trained the PPAP models using the normal distribution N(μ,σ2)𝑁𝜇superscript𝜎2N(\mu,\sigma^{2})italic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), with the loss function being the log-probability of the ground-truth response being observed, given the output distribution parameters μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ.

Using the ARAUS dataset responses, we trained two different variations of the PPAP models: one performing augmentation in the time domain (i.e., taking in log-mel spectrograms of the augmented soundscapes as initial input) [105], and one performing augmentation in the feature domain (i.e., taking in log-mel spectrograms of base urban soundscapes and maskers separately as initial inputs, and subsequently performing augmentation on features extracted from them) [106], both with multi-head attention in the feature mapping blocks. The parameters for the computation of the log-mel spectrograms and the training procedure for the PPAP models were similar to that of the CNN models.

5.2 Results and Discussion

TABLE V: Mean squared error values (±plus-or-minus\pm± standard deviation over 10 runs where applicable) for models described in Section 5. Smaller values are desirable, as denoted by the down arrow ()(\downarrow)( ↓ ).
Mean squared error ()(\downarrow)( ↓ )
Model Params Train Validation Test
Dummy (mean of labels) 0.1551 0.1553 0.1192
Elastic net 4 0.1351 0.1357 0.0930
CNN based on DCASE 2020 Task 1B baseline [41] 142K 0.1152 ±plus-or-minus\pm± 0.0020 0.1212 ±plus-or-minus\pm± 0.0010 0.0865 ±plus-or-minus\pm± 0.0063
Multi-head attention PPAP, additive time-domain augmentation [105] 123K 0.1098 ±plus-or-minus\pm± 0.0047 0.1216 ±plus-or-minus\pm± 0.0016 0.0889 ±plus-or-minus\pm± 0.0064
Multi-head attention PPAP, additive feature-domain augmentation [106] 515K 0.1086 ±plus-or-minus\pm± 0.0036 0.1238 ±plus-or-minus\pm± 0.0022 0.0838 ±plus-or-minus\pm± 0.0103

The label means for the normalized ISO Pleasantness P𝑃Pitalic_P were 0.02080.02080.02080.0208, 0.03170.03170.03170.0317, 0.03070.03070.03070.0307, 0.02810.02810.02810.0281, and 0.02250.02250.02250.0225 when folds 1111, 2222, 3333, 4444, and 5555 were respectively left out as the validation fold for the derivation of the dummy models. Hence, these were also the exact predictions given by all dummy models regardless of the soundscape presented. Considering that P[1,1]𝑃11P\in[-1,1]italic_P ∈ [ - 1 , 1 ], the difference in label means by fold is also insubstantial, since they are within 0.5 %times0.5percent0.5\text{\,}\mathrm{\char 37\relax}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG % end_ARG of the full range. In fact, the distributions of the training set labels themselves have no significant difference by fold, as shown in Fig. 8.

Refer to caption
Figure 8: Violin plots of distributions of normalized ISO Pleasantness (ground-truth labels) in training set, by fold left out as validation fold.

After training the elastic net models, the weights for all but three input features were set to zero regardless of the fold used as the validation set. Taking the mean of weights for the regression models across the 5-fold cross-validation set, this gave the general regression model

P^=[+0.30357.87311030.45121031.7326103]𝖳[1NmaxM10 kHzM12.5 kHz],^𝑃superscriptmatrix0.30357.8731superscript1030.4512superscript1031.7326superscript103𝖳matrix1subscript𝑁maxsubscript𝑀times10kilohertzsubscript𝑀times12.5kilohertz\widehat{P}=\begin{bmatrix}[l]+0.3035\\ -7.8731\cdot 10^{-3}\\ -0.4512\cdot 10^{-3}\\ -1.7326\cdot 10^{-3}\end{bmatrix}^{\mathsf{T}}\begin{bmatrix}[l]1\\ N_{\text{max}}\\ M_{\text{$10\text{\,}\mathrm{kHz}$}}\\ M_{\text{$12.5\text{\,}\mathrm{kHz}$}}\end{bmatrix},over^ start_ARG italic_P end_ARG = [ start_ARG start_ROW start_CELL + 0.3035 end_CELL end_ROW start_ROW start_CELL - 7.8731 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 0.4512 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 1.7326 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_M start_POSTSUBSCRIPT start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_M start_POSTSUBSCRIPT start_ARG 12.5 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (3)

where Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the maximum loudness (in sone), Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the power (in  dBtimesabsentdecibel\text{\,}\mathrm{dB}start_ARG end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG) at the one-third octave band with center frequency f𝑓fitalic_f, the prediction for the normalized ISO Pleasantness value P𝑃Pitalic_P is denoted as P^^𝑃\widehat{P}over^ start_ARG italic_P end_ARG, and the regression coefficients are not normalized.

This implies that of the 264264264264 acoustic and psychoacoustic parameters used as input features to train the regression models, the features that most significantly impacted the ISO Pleasantness were Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, M10 kHzsubscript𝑀times10kilohertzM_{\text{$10\text{\,}\mathrm{kHz}$}}italic_M start_POSTSUBSCRIPT start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG end_POSTSUBSCRIPT and M12.5 kHzsubscript𝑀times12.5kilohertzM_{\text{$12.5\text{\,}\mathrm{kHz}$}}italic_M start_POSTSUBSCRIPT start_ARG 12.5 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG end_POSTSUBSCRIPT, at least based on the responses in the ARAUS dataset. Incidentally, for the stimuli in the ARAUS dataset, we observed that the ranges of these features were Nmax[8.36,94.1]subscript𝑁max8.3694.1N_{\text{max}}\in[8.36,94.1]italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ∈ [ 8.36 , 94.1 ], M10 kHz[29.62,85.15]subscript𝑀times10kilohertz29.6285.15M_{\text{$10\text{\,}\mathrm{kHz}$}}\in[29.62,85.15]italic_M start_POSTSUBSCRIPT start_ARG 10 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG end_POSTSUBSCRIPT ∈ [ 29.62 , 85.15 ] and M12.5 kHz[32.51,78.67]subscript𝑀times12.5kilohertz32.5178.67M_{\text{$12.5\text{\,}\mathrm{kHz}$}}\in[32.51,78.67]italic_M start_POSTSUBSCRIPT start_ARG 12.5 end_ARG start_ARG times end_ARG start_ARG roman_kHz end_ARG end_POSTSUBSCRIPT ∈ [ 32.51 , 78.67 ], which meant that the predictions had a range of P^[0.6121,0.1680]^𝑃0.61210.1680\widehat{P}\in[-0.6121,0.1680]over^ start_ARG italic_P end_ARG ∈ [ - 0.6121 , 0.1680 ]. The negative coefficients of the features in Equation (3) also suggest that increasing the peak loudness and high-frequency components of an augmented soundscape correlate with a decrease in its perceived pleasantness, corroborating the respective findings by [109] and [110] regarding general acoustic environments.

Lastly, Table V compares the performance of the four baseline models and the dummy model with respect to the MSEs achieved across their training, validation, and test sets in the 5555-fold cross-validation scheme used to train them. All baseline models performed better than the dummy model in the training, validation, and test sets, which suggests that the features trained in all the baseline models to make ISO Pleasantness predictions were indeed meaningful.

In addition, the CNN and PPAP models performed significantly better than the elastic net models, with the lowest mean train, validation, and test set MSE of 0.10860.10860.10860.1086, 0.12120.12120.12120.1212, and 0.08380.08380.08380.0838 observed with the PPAP performing feature-domain augmentation [106], the CNN [41], and the PPAP performing feature-domain augmentation [106], respectively. This is expected due to the vast difference in the numbers of parameters used by the CNN and PPAP models (142–515K) for prediction, in comparison to the regression models (4444 parameters) and dummy model. Nevertheless, none of the models were overfitted to the dataset, as can be seen from the relatively similar values for training and validation set MSEs across all models, which demonstrates the versatility of the ARAUS dataset for use in training generalizable high-parameter models.

Notably, the test set MSE values are also smaller than that of the validation set, which could be due to the relative size of the test set (48484848 samples) as compared to the validation set (5040504050405040 samples) making the test set “easier” from the perspective of the prediction models than their respective validation sets.

6 Limitations and Future Work

While the ARAUS dataset utilized recordings of in-situ urban soundscapes as part of its stimuli, the laboratory-based data collection process means that models trained using the ARAUS dataset may require confirmation of their ecological validity with follow-up in-situ experiments or soundwalks. The range of SMRs used for the generation of the augmented soundscapes in the ARAUS dataset (from 6-6-6- 6 to +6 dBtimes+6decibel+6\text{\,}\mathrm{dB}start_ARG + 6 end_ARG start_ARG times end_ARG start_ARG roman_dB end_ARG) is also a potential limitation of the dataset, since real-life or virtual sound sources used for augmentation in general could have relative differences outside of the chosen range, so future data collection related to the ARAUS dataset could consider increasing the range of SMRs used for the generation of stimuli, which could be done by changing input parameter settings in the provided replication code. Not all combinations of urban soundscapes, maskers, and SMRs were exhaustively generated for the current iteration of the ARAUS dataset, so responses could also be collected exhaustively in the future to make the dataset fully comprehensive. Responses could alternatively be collected continuously over time, instead of just once after the presentation of each stimulus, in order to study the consistency of participants’ affective responses over time. The visual content of the stimuli could also be varied, such as by using video recordings captured at completely different locations, for the same audio recording, in order to investigate changes in affective responses due purely to changes in the visual content.

Moreover, since the participants in the ARAUS dataset were mostly young university students, the results and analysis obtained may not necessarily translate equivalently to a more general population of people exposed to urban soundscapes. The inclusion of only five participants in the test set, as compared to 120120120120 in each fold of the cross-validation set, is also a primary limitation of the dataset. At the scale of the test set, person-to-person differences in perception could dominate any other factor contributing to perception, thereby causing the test set to serve more as a small focus group of participants rather than a general sample representative of the same population as the cross-validation set. Hence, future extensions to the ARAUS dataset could concentrate on enlarging the test set and obtaining responses from participants from an older demographic, to allow for improved benchmarking of models trained using the dataset.

The models described in this article were also trained using the individual augmented soundscapes in the ARAUS dataset in isolation from each other, so they did not account for the effects of temporal successions of different maskers, and hence different augmented soundscapes. Naïvely applying them to choose a time series of optimal maskers for an extended duration of time could lead to a dissonant succession of maskers that may inadvertently result in an unpleasant augmented soundscape overall, despite the individual maskers being predicted as optimal for individual time windows. Further work on these models could thus look into generalizing them to extended durations of time. The models could also be fine-tuned via transfer learning methods on the affective soundscape datasets in Table I (which possess labels of a different nature), and be compared with other benchmarks to assess the amenability of the ARAUS dataset to transfer learning.

Lastly, other perceptual indicators other than those defined in ISO/TS 12913-2:2018, such as the perceived loudness or tranquility, could also be used as part of the ARQ to expand its scope and allow the ARAUS dataset to possess greater generalizability.

7 Conclusion

In conclusion, we presented the Affective Responses to Augmented Urban Soundscapes (ARAUS) dataset, which functions as a benchmark dataset for comparisons of prediction models for the affective perception of urban and augmented soundscapes. We first presented a systematic methodology for the collection of data, which can be replicated or extended upon. Subsequently, we analyzed the responses obtained for the ARAUS dataset, and provided benchmark models for predicting the perceptual attribute of ISO Pleasantness. To the best of our knowledge, the ARAUS dataset is currently the largest soundscape dataset with perceptual labels, but is not without its inherent limitations as described in Section 6.

Nonetheless, we hope that the ARAUS dataset becomes a beneficial and enduring resource for the soundscape community, by assisting soundscape researchers in developing more accurate, robust models for soundscape perception.

Acknowledgments

We would like to thank Prof. Chen Jer-Ming, Mr. Tan Yi Xian, and Ms. Cindy Lin for assisting with the administrative arrangements and setup of the test site at the Singapore University of Technology and Design.

This research is supported by the Singapore Ministry of National Development and the National Research Foundation, Prime Minister’s Office under the Cities of Tomorrow Research Programme (Award No. COT-V4-2020-1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the view of National Research Foundation, Singapore, and Ministry of National Development, Singapore.

References

  • [1] M. Raimbault, C. Lavandier, and M. Bérengier, “Ambient sound assessment of urban environments: Field studies in two French cities,” Applied Acoustics, vol. 64, no. 12, pp. 1241–1256, 2003.
  • [2] P. Jennings and R. Cain, “A framework for improving urban soundscapes,” Applied Acoustics, vol. 74, no. 2, pp. 293–299, 2013.
  • [3] K. M. De Paiva Vianna, M. R. Alves Cardoso, and R. M. C. Rodrigues, “Noise pollution and annoyance: An urban soundscapes study,” Noise and Health, vol. 17, no. 76, pp. 125–133, 2015.
  • [4] J. Kang, et al., “Towards soundscape indices,” in 23rd International Congress on Acoustics, 2019, pp. 2488–2495.
  • [5] H. M. E. Miedema and H. Vos, “Exposure-response relationships for transportation noise,” The Journal of the Acoustical Society of America, vol. 104, no. 6, pp. 3432–3445, 1998.
  • [6] B. De Coensel and D. Botteldooren, “Models for soundscape perception and their use in planning,” in Proceedings of Inter-Noise 2007, 2007.
  • [7] O. Axelsson, M. E. Nilsson, B. Hellström, and P. Lundén, “A field experiment on the impact of sounds from a jet-and-basin fountain on soundscape quality in an urban park,” Landscape and Urban Planning, vol. 123, pp. 49–60, 2014.
  • [8] Z. Abdalrahman and L. Galbrun, “Audio-visual preferences, perception, and use of water features in open-plan offices,” Journal of the Acoustical Society of America, vol. 147, no. 3, pp. 1661–1672, 2020.
  • [9] J. Y. Hong, et al., “The effects of spatial separations on water sound and traffic noise sources on soundscape assessment,” Building and Environment, vol. 167, no. 106423, 2020.
  • [10] F. Aletta and J. Kang, “Towards an urban vibrancy model: A soundscape approach,” International Journal of Environmental Research and Public Health, vol. 15, no. 8, 2018.
  • [11] International Organization for Standardization, ISO 12913-3:2019 - Acoustics - Soundscape - Part 3: Data analysis.   Geneva, Switzerland: International Organization for Standardization, 2019.
  • [12] T. Van Renterghem, et al., “Interactive soundscape augmentation by natural sounds in a noise polluted urban park,” Landscape and Urban Planning, vol. 194, no. October 2019, p. 103705, 2020.
  • [13] T. Wong, et al., “Deployment of an IoT System for Adaptive In-Situ Soundscape Augmentation,” in Proceedings of Inter-Noise 2022, 2022.
  • [14] T. M. Leung, C. K. Chau, and S. K. Tang, “On the study of effects on different types of natural sounds on the perception of combined sound environment with road traffic noise,” in Proceedings of Inter-Noise 2016, 2016, pp. 1764–1770.
  • [15] J. Y. Hong, et al., “A mixed-reality approach to soundscape assessment of outdoor urban environments augmented with natural sounds,” Building and Environment, vol. 194, no. July 2020, p. 107688, 2021.
  • [16] P. Aumond, A. Can, B. De Coensel, D. Botteldooren, C. Ribeiro, and C. Lavandier, “Modeling soundscape pleasantness using perceptual assessments and acoustic measurements along paths in urban context,” Acta Acustica united with Acustica, vol. 103, no. 3, pp. 430–443, 2017.
  • [17] V. Puyana-Romero, G. Ciaburro, G. Brambilla, C. Garzón, and L. Maffei, “Representation of the soundscape quality in urban areas through colours,” Noise Mapping, vol. 6, no. 1, pp. 8–21, 2019.
  • [18] F. Aletta, et al., “Soundscape assessment: Towards a validated translation of perceptual attributes in different languages,” in Proceedings of Inter-Noise 2020, 2020.
  • [19] M. Lionello, F. Aletta, and J. Kang, “A systematic review of prediction models for the experience of urban soundscapes,” Applied Acoustics, vol. 170, p. 107479, 2020.
  • [20] International Organization for Standardization, ISO 12913-2 Acoustics - Soundscape - Part 2: Data collection and reporting requirements.   Geneva, Switzerland: International Organization for Standardization, 2018.
  • [21] S. Wang, T. Heittola, A. Mesaros, and T. Virtanen, “Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions,” in Proceedings of DCASE 2021 Workshop, 2021, pp. 45–49.
  • [22] I. Martín-Morató, T. Heittola, A. Mesaros, and T. Virtanen, “Low-complexity acoustic scene classification for multi-device audio: Analysis of DCASE 2021 challenge systems,” in Proceedings of DCASE 2021 Workshop, 2021, pp. 85–89.
  • [23] A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection,” in Proceedings of DCASE 2021 Workshop, 2021, pp. 125–129.
  • [24] Y. Kawaguchi, et al., “Description and Discussion on DCASE 2021 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions,” in Proceedings of DCASE 2021 Workshop, 2021, pp. 186–190.
  • [25] J. F. Gemmeke, et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in Proceedings of IEEE ICASSP 2017, 2017, pp. 776–780.
  • [26] S. Hershey, et al., “The benefit of temporally-strong labels in audio event classification,” in Proceedings of IEEE ICASSP 2021, 2021, pp. 366–370.
  • [27] ——, “AudioSet: Temporally-Strong Labels Download (May 2021),” 2021. [Online]. Available: https://research.google.com/audioset/download˙strong.html
  • [28] ——, “CNN architectures for large-scale audio classification,” in Proceedings of IEEE ICASSP 2017, 2017, pp. 131–135.
  • [29] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 2014 ACM Multimedia Conference, 2014, pp. 1041–1044.
  • [30] K. J. Piczak, “ESC: Dataset for environmental sound classification,” in Proceedings of the 2015 ACM Multimedia Conference, 2015, pp. 1015–1018.
  • [31] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An Open Dataset of Human-Labeled Sound Events,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 30, pp. 829–852, 2022.
  • [32] F. Font, G. Roma, and X. Serra, “Freesound Technical Demo,” in Proceedings of the 2013 ACM Multimedia Conference, 2013, pp. 411–412.
  • [33] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: a library for soundscape synthesis and augmentation,” 2017, pp. 344–348.
  • [34] B. De Coensel, K. Sun, and D. Botteldooren, “Urban Soundscapes of the World: Selection and reproduction of urban acoustic environments with soundscape in mind,” in Proceedings of Inter-Noise 2017, 2017.
  • [35] M. Ciufo and D. Thomas, “EigenScape: A Database of Spatial Acoustic Scene Recordings,” Applied Sciences, vol. 7, 2017.
  • [36] M. Cartwright, et al., “SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network,” in Proceedings of DCASE 2019 Workshop, 2019.
  • [37] ——, “SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context,” in Proceedings of DCASE 2020 Workshop, 2020, pp. 16–20.
  • [38] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in Proceedings of EUSIPCO 2016, 2016, pp. 1128–1132.
  • [39] A. Mesaros, et al., “DCASE 2017 challenge setup: tasks, datasets and baseline system,” in Proceedings of DCASE 2017 Workshop, 2017.
  • [40] A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” in Proceedings of DCASE 2018 Workshop, 2018.
  • [41] T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions,” in Proceedings of DCASE 2020 Workshop, 2020.
  • [42] S. Wang, A. Mesaros, T. Heittola, and T. Virtanen, “A curated dataset of urban scenes for audio-visual scene analysis,” in Proceedings of IEEE ICASSP 2021, 2021, pp. 626–630.
  • [43] K. Ooi, et al., “A Strongly-Labelled Polyphonic Dataset of Urban Sounds with Spatiotemporal Context,” in Proceedings of APSIPA ASC 2021, 2021.
  • [44] A. Weisser, et al., “The Ambisonic Recordings of Typical Environments (ARTE) database,” Acta Acustica united with Acustica, vol. 105, no. 4, pp. 695–713, 2019.
  • [45] A. Politis, et al., “STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” 2022. [Online]. Available: https://doi.org/10.5281/zenodo.6387880
  • [46] M. M. Bradley and P. J. Lang, “The International Affective Digitized Sounds (2nd Edition; IADS-2): Affective ratings of sounds and instruction manual. Technical report B-3.” University of Florida, Gainesville, Fl., Tech. Rep., 2007.
  • [47] R. A. Stevenson and T. W. James, “Affective auditory stimuli: Characterization of the International Affective Digitized Sounds (IADS) by discrete emotional categories,” Behavior Research Methods, vol. 40, no. 1, pp. 315–321, 2008.
  • [48] W. Yang, et al., “Affective auditory stimulus database: An expanded version of the International Affective Digitized Sounds (IADS-E),” Behavior Research Methods, vol. 50, no. 4, pp. 1415–1429, 2018.
  • [49] J. Fan, M. Thorogood, and P. Pasquier, “Emo-soundscapes: A dataset for soundscape emotion recognition,” in Proceedings of ACII 2017, 2017, pp. 196–201.
  • [50] M. M. Bradley and P. J. Lang, “Measuring emotion: The self-assessment manikin and the semantic differential,” Journal of Behavior Therapy and Experimental Psychiatry, vol. 25, no. 1, pp. 49–59, 1994.
  • [51] T. Giannakopoulos, M. Orfanidi, and S. Perantonis, “Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition,” in Proceedings of 25th International Conference on Multimedia Modeling, 2019, pp. 338–348.
  • [52] A. Mitchell, et al., “The International Soundscape Database: An integrated multimedia database of urban soundscape surveys – questionnaires with acoustical and contextual information,” 2021. [Online]. Available: https://doi.org/10.5281/Zenodo.5914762
  • [53] ——, “The Soundscape Indices (SSID) Protocol : A Method for Urban Soundscape Surveys — Questionnaires with Acoustical and Contextual Information,” Applied Sciences, vol. 10, no. 2397, pp. 1–27, 2020.
  • [54] Y. Hao, J. Kang, and H. Wortche, “Assessment of the masking effects of birdsong on the road traffic noise environment,” Journal of the Acoustical Society of America, vol. 140, no. 2, pp. 978–987, 2016.
  • [55] T. M. Leung, C. K. Chau, S. K. Tang, and J. M. Xu, “Developing a multivariate model for predicting the noise annoyance responses due to combined water sound and road traffic noise exposure,” Applied Acoustics, vol. 127, pp. 284–291, 2017.
  • [56] F. Aletta, T. Oberman, A. Mitchell, H. Tong, and J. Kang, “Assessing the changing urban sound environment during the COVID-19 lockdown period using short-term acoustic measurements,” Noise Mapping, vol. 7, no. 1, pp. 123–134, 2020.
  • [57] M. E. Lutman, “What is the risk of noise-induced hearing loss at 80, 85, 90 dB(A) and above?” Occupational Medicine, vol. 50, no. 4, pp. 274–275, 2000.
  • [58] B. Planqué and W.-P. Vellinga, “Xeno-canto: a 21st century way to appreciate Neotropical bird song,” Neotropical Birding, vol. 3, no. January, pp. 17–23, 2008.
  • [59] J. Y. Jeon, P. J. Lee, J. You, and J. Kang, “Acoustical characteristics of water sounds for soundscape enhancement in urban open spaces,” Journal of the Acoustical Society of America, vol. 131, no. 3, pp. 2101–2109, 2012.
  • [60] L. Galbrun and T. T. Ali, “Acoustical and perceptual assessment of water sounds and their use over road traffic noise,” Journal of the Acoustical Society of America, vol. 133, no. 1, pp. 227–237, 2013.
  • [61] B. De Coensel, S. Vanwetswinkel, and D. Botteldooren, “Effects of natural sounds on the perception of road traffic noise,” JASA Express Letters, vol. 129, no. 4, pp. 148–153, 2011.
  • [62] D. M. Ferraro, et al., “The phantom chorus: birdsong boosts human well-being in protected areas,” Proceedings of the Royal Society B, vol. 287, no. 1941, 2020.
  • [63] M. Hedblom, I. Knez, Ode Sang, and B. Gunnarsson, “Evaluation of natural sounds in urban greenery: Potential impact for urban nature preservation,” Royal Society Open Science, vol. 4, no. 2, 2017.
  • [64] World Health Organization Regional Office for Europe, Environmental Noise Guidelines for the European Region.   Copenhagen: The Regional Office for Europe of the World Health Organization, 2018.
  • [65] J. You and J. Y. Jeon, “Sound-masking technique for combined noise exposure in open public spaces,” in Proceedings of ICBEN 2008, 2008.
  • [66] N. Pieretti and A. Farina, “Application of a recently introduced index for acoustic complexity to an avian soundscape with traffic noise,” Journal of the Acoustical Society of America, vol. 134, no. 1, pp. 891–900, 2013.
  • [67] X. Lu, J. Tang, P. Zhu, F. Guo, J. Cai, and H. Zhang, “Spatial variations in pedestrian soundscape evaluation of traffic noise,” Environmental Impact Assessment Review, vol. 83, 2020.
  • [68] C. Xu and J. Kang, “Soundscape evaluation: Binaural or monaural?” Journal of the Acoustical Society of America, vol. 145, no. 5, pp. 3208–3217, 2019.
  • [69] M. S. Engel, A. Fiebig, C. Pfaffenbach, and J. Fels, “A Review of the Use of Psychoacoustic Indicators on Soundscape Studies,” Current Pollution Reports, vol. 7, no. 3, pp. 359–378, 2021.
  • [70] J. Y. Hong, et al., “Effects of adding natural sounds to urban noises on the perceived loudness of noise and soundscape quality,” Science of the Total Environment, vol. 711, 2020.
  • [71] German Institute for Standardization, DIN 45692: Measurement technique for the simulation of the auditory sensation of sharpness.   Beuth Verlag GmbH, 2009.
  • [72] International Organization for Standardization, ISO 532-1: Acoustics - Methods for calculating loudness - Part 1: Zwicker method, Geneva, 2014.
  • [73] H. Fastl and E. Zwicker, Psychoacoustics - Facts and Models, T. S. Huang, M. R. Schroeder, and T. Kohonen, Eds.   Springer, 2001.
  • [74] Ecma International, ECMA-418-2:2020 - Psychoacoustic metrics for ITT equipment - Part 2 (models based on human perception), 1st ed., Geneva, Switzerland, 2020.
  • [75] ——, ECMA-74 - Acoustics - Measurement of airborne noise emitted by information technology and telecommunications equipment, 19th ed., Geneva, Switzerland, 2021.
  • [76] International Organization for Standardization, ISO 1996-1:2016 Acoustics — Description , measurement and assessment of environmental noise — Part 1: Basic quantities and assessment procedures.   Geneva: International Organization for Standardization, 2016.
  • [77] R. San Millán-Castillo, E. Latorre-Iglesias, D. Jiménez-Caminero, J. M. Álvarez-Jimeno, M. Glesser, and S. Wanty, “MOSQITO: An open-source and free toolbox for sound quality metrics in the industry and education,” in Proceedings of Inter-Noise 2021, 2021.
  • [78] C. Flowers, F. M. Le Tourneau, N. Merchant, B. Heidorn, R. Ferriere, and J. Harwood, “Looking for the -scape in the sound: Discriminating soundscapes categories in the Sonoran Desert using indices and clustering,” Ecological Indicators, vol. 127, 2021.
  • [79] T. Kohonen, Self-organizing maps.   Springer-Verlag Berlin Heidelberg, 2001.
  • [80] K. Ooi, Y. Xie, B. Lam, and W. S. Gan, “Automation of binaural headphone audio calibration on an artificial head,” MethodsX, vol. 8, no. February, pp. 1–12, 2021.
  • [81] J. Abeßer, “USM-SED - A Dataset for Polyphonic Sound Event Detection in Urban Sound Monitoring Scenarios,” 2021. [Online]. Available: http://arxiv.org/abs/2105.02592
  • [82] J. J. Walker, L. M. Cleveland, J. L. Davis, and J. S. Seales, “Audiometry screening and interpretation,” American Family Physician, vol. 87, no. 1, pp. 41–47, 2013.
  • [83] G. M. Echevarria Sanchez, T. Van Renterghem, K. Sun, B. De Coensel, and D. Botteldooren, “Using Virtual Reality for assessing the role of noise in the audio-visual design of an urban public space,” Landscape and Urban Planning, vol. 167, pp. 98–107, 2017.
  • [84] M. Wang, Y. Ai, Y. Han, Z. Fan, P. Shi, and H. Wang, “Extended high-frequency audiometry in healthy adults with different age groups,” Journal of Otolaryngology - Head and Neck Surgery, vol. 50, no. 1, pp. 1–6, 2021.
  • [85] M. Andrew, T. Oberman, F. Aletta, M. Kachlicka, M. Lionello, M. Erfanian, and J. Kang, “Investigating urban soundscapes of the COVID-19 lockdown: A predictive soundscape modeling approach,” The Journal of the Acoustical Society of America, vol. 150, no. 6, pp. 4474–4488, 2021.
  • [86] International Organization for Standardization, ISO 12913-1:2014 - Acoustics - Soundscape - Part 1: Definition and conceptual framework.   Geneva, Switzerland: International Organization for Standardization, 2014.
  • [87] N. D. Weinstein, “Individual differences in reactions to noise: A longitudinal study in a college dormitory,” Journal of Applied Psychology, vol. 63, no. 4, pp. 458–466, 1978.
  • [88] S. Cohen, T. Kamarck, and R. Mermelstein, “A Global Measure of Perceived Stress,” Journal of Health and Social Behavior, vol. 24, no. 4, pp. 385–396, 1983.
  • [89] WHO Regional Office for Europe, Wellbeing measures in primary health care, Cophenhagen, Denmark, 1998.
  • [90] D. Watson, L. A. Clark, and A. Tellegan, “Development and Validation of Brief Measures of Positive and Negative Affect: The PANAS Scales,” Journal of Personality and Social Psychology, vol. 54, no. 6, pp. 1063–1070, 1988.
  • [91] F. Aletta, et al., “The relationship between noise sensitivity and soundscape appraisal of care professionals in their work environment: a case study in Nursing Homes in Flanders, Belgium,” in Proceedings of Euro-Noise 2018, 2018.
  • [92] E. Ratcliffe, “Sound and Soundscape in Restorative Natural Environments: A Narrative Literature Review.” Frontiers in Psychology, vol. 12, p. 570563, 2021.
  • [93] M. Masullo, et al., “A questionnaire investigating the emotional salience of sounds,” Applied Acoustics, vol. 182, p. 108281, 2021.
  • [94] S. Cohen and G. Williamson, “Perceived stress in a probability sample of the United States,” The Social Psychology of Health, vol. 13, pp. 31–67, 1988.
  • [95] A. Mitchell, F. Aletta, and J. Kang, “How to analyse and represent quantitative soundscape data,” JASA Express Letters, vol. 2, p. 037201, 2022.
  • [96] R. Guski, D. Schreckenberg, and R. Schuemer, “A systematic review on environmental noise and annoyance,” International Journal of Environmental Research and Public Health, vol. 14, no. 12, pp. 1–39, 2017.
  • [97] A. Hong, B. Kim, and M. Widener, “Noise and the city: Leveraging crowdsourced big data to examine the spatio-temporal relationship between urban development and noise annoyance,” Environment and Planning B: Urban Analytics and City Science, vol. 47, no. 7, pp. 1201–1218, 2020.
  • [98] T. Van Renterghem and D. Botteldooren, “Effect of a row of trees behind noise barriers in wind,” Acta Acustica united with Acustica, vol. 88, no. 6, pp. 869–878, 2002.
  • [99] W. Yang and J. Kang, “Acoustic comfort evaluation in urban open public spaces,” Applied Acoustics, vol. 66, no. 2, pp. 211–229, 2005.
  • [100] X. Fang, et al., “Soundscape Perceptions and Preferences for Different Groups of Users in Urban Recreational Forest Parks,” Forests, vol. 12, no. 4, p. 468, 2021.
  • [101] L. J. Cronbach, “Coefficient alpha and the internal structure of tests,” Psychometrika, vol. 16, no. 3, pp. 297–334, 1951.
  • [102] R. P. McDonald, Test Theory: A Unified Treatment.   Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc., 1999.
  • [103] K. S. Taber, “The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education,” Research in Science Education, vol. 48, no. 6, pp. 1273–1296, 2018.
  • [104] D. Worthington, “Weinstein Noise Sensitivity Scale (WNSS),” in The Sourcebook of Listening Research: Methodology and Measures, 1st ed., D. Worthington and G. Bodie, Eds.   John Wiley and Sons Ltd, 2018, pp. 475–481.
  • [105] K. Ooi, K. N. Watcharasupat, B. Lam, Z.-T. Ong, and W.-S. Gan, “Probably Pleasant? A Neural-Probabilistic Approach to Automatic Masker Selection for Urban Soundscape Augmentation,” in Proceedings of IEEE ICASSP 2022, 2022, p. 5.
  • [106] K. N. Watcharasupat, K. Ooi, B. Lam, T. Wong, Z.-T. Ong, and W.-S. Gan, “Autonomous In-Situ Soundscape Augmentation via Joint Selection of Masker and Gain,” pp. 1–5, 2022. [Online]. Available: http://arxiv.org/abs/2204.13883
  • [107] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society. Series B: Statistical Methodology, vol. 67, no. 5, p. 768, 2005.
  • [108] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proceedings of ICLR 2015, 2015, pp. 1–15.
  • [109] K. Ma, C. Mak, and H. Wong, “Effects of environmental sound quality on soundscape preference in a public urban space,” Applied Acoustics, vol. 171, 2021.
  • [110] G. R. Kidd and C. S. Watson, “The perceptual dimensionality of environmental sounds,” Noise Control Engineering Journal, vol. 51, no. 4, pp. 216–231, 2003.
[Uncaptioned image] Kenneth Ooi (S’21) received the B.Sc. (Hons.) degree in mathematical sciences from Nanyang Technological University (NTU), Singapore, in 2019, and received the Lee Kuan Yew Gold Medal as the top graduand of his cohort. He is currently pursuing the Ph.D. degree in electrical engineering in NTU under the supervision of Prof. Woon-Seng Gan. His research interests include deep learning for acoustic scene and event classification, as well as corresponding applications to the field of soundscape research.
[Uncaptioned image] Zhen-Ting Ong received the B.Eng. degree in electrical and electronic engineering from Nanyang Technological University (NTU), Singapore, in 2016. He is currently a Research Engineer at the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. Since 2016, he has published more than 30 works on soundscape engineering in international conferences and journals.
[Uncaptioned image] Karn N. Watcharasupat (S’19-GS’22) received the B.Eng. (Hons.) degree in electrical and electronic engineering, from Nanyang Technological University (NTU), Singapore, in 2022. She was a recipient of the Nanyang Scholarship (CN Yang Scholars Programme) and the Lee Kuan Yew Gold Medal for the Class of 2022. She is currently pursuing a Ph.D. in music technology at the Music Informatics Group, Center for Music Technology (GTCMT), Georgia Institute of Technology, Atlanta, GA, USA, where she was also a visiting student (2020; 2021-2022 remote). At NTU, she was with the Media Technology Laboratory (2018-2020), and later the Digital Signal Processing Laboratory (2020-2022). Her research interests are in signal processing, machine learning, and artificial intelligence for music and audio applications. She is a co-inventor of two patent applications and has published more than 20 papers in international conferences and journals on music information retrieval, soundscapes, spatial audio, speech enhancement, and blind source separation. She currently serves as the Treasurer for Women in Music Technology at Georgia Tech, and was a tech volunteer for the 22nd International Society for Music Information Retrieval Conference (ISMIR). She also serves as a reviewer for the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP; 2022); the EURASIP Journal on Audio, Speech, and Music Processing (2022-); Digital Signal Processing (2022-); and Applied Acoustics (2022-).
[Uncaptioned image] Bhan Lam (S’18-M’19) received the B.Eng. (Hons.) and Ph.D. degrees both from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, in 2013 and 2019, respectively. He was awarded the NTU Research Scholarship and EEE Graduate Award to undertake his PhD under the supervision of Prof. Woon-Seng Gan. In 2015, he was a visiting postgrad in the signal processing and control group at the Institute of Sound and Vibration Research, University of Southampton, UK. He was an invited representative at the 2020 Global Young Scientist Summit and an invited tutorial speaker at APSIPA ASC 2020. He is currently a Research Assistant Professor at the School of Electrical and Electronic Engineering, NTU, Singapore. He has authored more than 60 refereed journal articles and conference papers in the areas of acoustics, soundscape, and signal processing for active control. His work on anti-noise windows has been patented (WO2022055432) and recognised as a top-100 paper in Nature Scientific Reports in 2020. He is currently a guest editor with MDPI Sustainability and served as a special session co-chair in the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). He was appointed by the Singapore Standards Council as the National Mirror Committee Chair of ISO TC43/SC1/WG54 on “Perceptual assessment of soundscape quality”. His current research interests include active noise control, soundscape, and signal processing for active control.
[Uncaptioned image] Joo Young Hong received the B.Sc. degree in Architectural Engineering and Ph.D. degree in Architectural Acoustics from Hanyang University, Seoul, Korea in 2010 and 2015, respectively. He was invited as one of the Korean participants to the Global Young Scientist Summit @one-north 2015, Singapore. He also won the Best Paper prize at the INTER-NOISE conference in San Francisco, USA, 2015, sponsored by the International Institute of Noise Control Engineering. He was also invited as a plenary speaker at the Urban Noise Symposium, 2016, Shanghai to talk about the soundscape design approach. In 2017, he was a recipient of the prestigious Lee Kuan Yew Post-Doctoral Fellowship at the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. In 2020, he was an Assistant Professor in Architecture and Sustainable Design at the Singapore University of Technology and Design. Since 2021, he has been an Assistant Professor at the Department of Architectural Engineering, Chungnam National University, Korea. His research interests fall under the umbrella of soundscape research, which is a new paradigm that emphasizes a holistic perspective of perceived acoustic environments in a given context. He has investigated relationships between physical acoustic phenomena and human auditory perception in indoor and outdoor environments through multidisciplinary approaches from acoustics, psychology, architecture, and urban planning. He has been involved as an assistant acoustic consultant in several practical auditorium design projects. He has published 28 peer-reviewed articles in international journals and 2 invited book chapters. He has also served as a technical committee member of Korean Industrial Standards (KS) on building and environmental noises.
[Uncaptioned image] Woon-Seng Gan (S’90–M’93-SM’00) received his B.Eng. (1st Class Hons.) and Ph.D. degrees, both in electrical and electronic engineering from the University of Strathclyde, United Kingdom, in 1989 and 1993 respectively. He is currently a Professor of Audio Engineering and the Director of the Smart Nation Lab in the School of Electrical and Electronic Engineering in Nanyang Technological University (NTU). He also served as the Head of the Information Engineering Division in the School of Electrical and Electronic Engineering in NTU (2011-2014), and the Director of the Centre for Infocomm Technology (2016-2019). His research concerns the connections between the physical world, signal processing and sound control, which has resulted in the practical demonstration and licensing of spatial audio algorithms, directional sound beam, and active noise control for headphones and open windows. Prof. Gan has published more than 400 international refereed journals and conferences, and has translated his research into 6 granted patents. He has co-authored three books on Subband Adaptive Filtering: Theory and Implementation (John Wiley, 2009); Embedded Signal Processing with the Micro Signal Architecture, (Wiley-IEEE, 2007); and Digital Signal Processors: Architectures, Implementations, and Applications (Prentice Hall, 2005). In 2017, he won the APSIPA Sadaoki Furui Prize Paper Award. He is a Fellow of the Audio Engineering Society (AES), a Fellow of the Institute of Engineering and Technology (IET), and a Senior Member of the IEEE. He served as an Associate Editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP; 2012-15) and was presented with an Outstanding TASLP Editorial Board Service Award in 2016. He also served as the Associate Editor for the IEEE Signal Processing Letters (2015-19). He is currently serving as a Senior Area Editor of the IEEE Signal Processing Letters (2019-); Associate Technical Editor of the Journal of Audio Engineering Society (JAES; 2013-); Editorial member of the Asia Pacific Signal and Information Processing Association (APSIPA; 2011-) Transaction on Signal and Information Processing; Associate Editor of the EURASIP Journal on Audio, Speech and Music Processing (2007-). He served as the Technical Program Chair of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), held in Singapore.

Appendix A Source Files for ARAUS Dataset

This appendix contains details of the exact source files used to generate the augmented soundscapes in the ARAUS dataset. Table A.1 shows the breakdown (by fold) of the soundscape recordings from the Urban Sounds of the World (USotW) database used as base soundscapes for the five-fold cross-validation set of the ARAUS dataset. The first half of R0091 was used as the “pre-experiment”, “attention”, and “post-experiment” stimulus, so R0091 is not considered to be part of any fold in the ARAUS dataset. Table A.2 shows the breakdown (by fold) of the source recordings from Freesound and Xeno-Canto used as maskers for both the five-fold cross-validation set and independent test set. Panoramic photos and GPS coordinates of the locations recorded for the independent test set of the ARAUS dataset are shown in Fig. A.1.

TABLE A.1: List of Urban Soundscape of the World (USotW) recordings used for cross-validation set of ARAUS dataset. Recordings are denoted using their index numbers in the USotW database. Single asterisks () and daggers () respectively denote recordings whose first and second halves were omitted from the ARAUS dataset. As of publication time, R0021, R0077, R0086, R0093, R0100, R0102 are not listed in the USotW database.
Fold USotW identifiers
Fold 1 R0005 R0011 R0017 R0018 R0027 R0046 R0048 R0055 R0056 R0071 R0079 R0080
R0081 R0084 R0089 R0092 R0095 R0111 R0115 R0116 R0122 R0124 R0126 R0129
Fold 2 R0001 R0006 R0016 R0020 R0025 R0026 R0033 R0037 R0040 R0044 R0052 R0053
R0067 R0076 R0078 R0082 R0088 R0103 R0106 R0107 R0110 R0117 R0123 R0133
Fold 3 R0009 R0012 R0024 R0028 R0029 R0032 R0034 R0041 R0042 R0045 R0051 R0057
R0059 R0063 R0065 R0070 R0085 R0094 R0104 R0109 R0114 R0118 R0120 R0132
Fold 4 R0002 R0003 R0004 R0013 R0015 R0019 R0022 R0031 R0039 R0043 R0047 R0050
R0054 R0061 R0072 R0074 R0087 R0097 R0098 R0099 R0112 R0113 R0127 R0131
Fold 5 R0007 R0008 R0010 R0023 R0030 R0035 R0036 R0038 R0049 R0058 R0060 R0062
R0064 R0069 R0073 R0075 R0090 R0096 R0105 R0108 R0119 R0121 R0128 R0130
TABLE A.2: List of tracks from Freesound (denoted as “FS”) and Xeno-canto (denoted as “XC”) used as maskers in ARAUS dataset. Numbers after “FS” and “XC” denote the index numbers of the tracks in the Freesound and Xeno-canto databases, respectively.
Masker track identifier
Fold Bird Construction Traffic Water Wind
Test XC568124 XC640568 FS586168 FS587219 FS587000 FS587759 FS587205
1 XC109203 XC482053 FS218748 FS84646 FS202915 FS463265 FS181255
XC134886 XC503057 FS246171 FS235531 FS345649 FS516934 FS244942
XC184374 XC518767 FS400991 FS243720 FS376801 FS541717 FS403051
XC185560 XC518843 FS421050 FS330288 FS410927 FS547136 FS444921
XC311306 XC556166 FS455683 FS337095 FS412308 FS547892 FS454373
XC370500 XC571000 FS553476 FS426886 FS415151 FS548476 FS483076
XC419391 XC612865 FS555037 FS454864 FS433589 FS550930 FS546527
XC470089 XC613796 FS555039 FS504138 FS450755 FS553051 FS548173
2 FS257445 XC477488 FS74505 FS67704 FS260056 FS441152 FS144083
FS317450 XC481933 FS134896 FS77016 FS336848 FS457565 FS185070
XC85417 XC509271 FS193351 FS160015 FS346641 FS459983 FS242064
XC133059 XC537855 FS194866 FS191350 FS365915 FS533932 FS422579
XC301810 XC553169 FS522584 FS322231 FS400402 FS534124 FS423800
XC332810 XC562095 FS553475 FS370009 FS401277 FS544183 FS444852
XC370485 XC600722 FS555025 FS433869 FS411509 FS550756 FS475448
XC420908 XC608496 FS555040 FS448092 FS423622 FS552457 FS540192
3 FS478637 XC566219 FS171400 FS93329 FS56771 FS388706 FS62056
XC122469 XC575300 FS289478 FS118046 FS169181 FS414762 FS73714
XC137556 XC591803 FS335963 FS160002 FS185062 FS415027 FS346106
XC140239 XC598469 FS361777 FS173154 FS243629 FS436813 FS397947
XC350943 XC601752 FS376405 FS336649 FS249485 FS537986 FS405561
XC485841 XC602571 FS383442 FS394689 FS256009 FS543579 FS423914
XC552316 XC602677 FS545183 FS439218 FS329680 FS546236 FS438857
XC554222 XC603552 FS555038 FS546474 FS384276 FS551989 FS491296
4 XC50850 XC480035 FS62394 FS65807 FS260980 FS478439 FS104875
XC112734 XC570338 FS149777 FS84645 FS264598 FS511098 FS131032
XC368749 XC579905 FS169098 FS149814 FS396056 FS542260 FS180025
XC375911 XC580986 FS170873 FS252216 FS438925 FS544838 FS182837
XC445899 XC600707 FS173106 FS274830 FS448766 FS546107 FS211820
XC449522 XC601772 FS192176 FS434302 FS454283 FS546804 FS345681
XC457222 XC604251 FS483556 FS463636 FS459850 FS548381 FS457318
XC467315 XC611766 FS555035 FS512142 FS460441 FS553159 FS546526
5 XC203258 XC489739 FS112637 FS23273 FS62395 FS489073 FS84111
XC242969 XC505573 FS118042 FS110309 FS167034 FS536223 FS104952
XC255139 XC545465 FS328141 FS180156 FS169250 FS547012 FS441866
XC348370 XC555184 FS362068 FS259629 FS261445 FS547232 FS469280
XC376468 XC578478 FS384837 FS261344 FS352902 FS547720 FS533930
XC450847 XC600292 FS488828 FS262830 FS365919 FS549929 FS546759
XC475279 XC604437 FS555026 FS503456 FS376709 FS552691 FS548235
XC482274 XC614063 FS555034 FS551451 FS469009 FS553135 FS551318
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure A.1: Panoramic photographs of locations where recordings for test set of ARAUS dataset were made, with coordinates specified as (latitude, longitude) pairs. From top to bottom: (a) A road facing a construction site (1.342123, 103.683790), (b) A gazebo in a park (1.342780, 103.684824), (c) A walkway facing a lake (1.346492, 103.687056), (d) A walkway facing a crowded canteen (1.342259, 103.682418), (e) A path facing a lake (1.344954, 103.684667), (f) A path facing a lake with an aircraft flying overhead (1.344954, 103.684667).

Appendix B Participant Information Questionnaire

This appendix lists the all the items for the participant information questionnaire (PIQ), which contains demographic and auxiliary information that we collected from the participants who contributed their responses to the ARAUS dataset.

The questionnaire was administered to all participants via a digital form. All questions were multiple-choice questions except for Question 2(a), which asked for the age of the participant and accepted positive integers. Possible options for the participants and numerical values used to code the responses for the dataset are shown in itemized square brackets. Yes/no questions are indicated by [Y/N] at the end of the questions. “No” is coded as 0 and “Yes” is coded as 1.

{mdframed}
  1. 1.

    Spoken languages

    1. (a)

      Do you speak fluently in any languages/dialects other than English? [Y/N]
      If your response is “No”, please go to Question 2.

    2. (b)

      Is English your first or native language? [Y/N]

    3. (c)

      Among the languages/dialects you speak, would you consider yourself to be most fluent in English? [Y/N]

{mdframed}
  1. 2.

    Demographic information

    1. (a)

      What is your age?

    2. (b)

      What is your gender?
      [0] Male [1] Female [0 or 1] Other/prefer not to say444Due to the risk of identification considering a very small number of participants chose this option, where participants responded “Other/prefer not to say”, we randomly coded the response as 0 or 1, each with a probability of 50 %times50percent50\text{\,}\mathrm{\char 37\relax}start_ARG 50 end_ARG start_ARG times end_ARG start_ARG % end_ARG.

    3. (c)

      What is your ethnic group?
      [0] Other [1] Chinese [2] Malay [3] Indian

    4. (d)

      What is the highest level of education you have completed?
      [0] Other [1] No qualification [2] Primary (PSLE), elementary school or equivalent [3] Secondary (GCE ‘N’ & ‘O’ level), middle school or equivalent [4] Institute of Technical Education or equivalent [5] Junior College (‘A’ level), high school or equivalent [6] Polytechnic and Arts Institution (Diploma level) or equivalent [7] University (Bachelor’s Degree) or equivalent [8] University (Master’s Degree) or equivalent [9] University (PhD)

    5. (e)

      What is your occupational status?
      [0] Other [1] Student [2] Employed [3] Retired [4] Unemployed
      If your response is “Student”, please go to Question 2(g)

    6. (f)

      What is the highest level of education you are currently undergoing?
      [0] Other
      [2] Primary (PSLE), elementary school or equivalent [3] Secondary (GCE ‘N’ & ‘O’ level), middle school or equivalent [4] Institute of Technical Education or equivalent [5] Junior College (‘A’ level), high school or equivalent [6] Polytechnic and Arts Institution (Diploma level) or equivalent [7] University (Bachelor’s Degree) or equivalent [8] University (Master’s Degree) or equivalent [9] University (PhD)

    7. (g)

      What dwelling type is your current main residence in Singapore?
      [0] Other [1] Housing Development Board (HDB) flat or other public apartment [2] Hall of Residence or other student dormitory [3] Landed property [4] Condominium or other private apartment

    8. (h)

      Are you a Singapore citizen? [Y/N]

    9. (i)

      Have you resided in Singapore for more than 10 years? [Y/N]

{mdframed}
  1. 3.

    How much has indoor/outdoor noise bothered, disturbed, or annoyed you over the past 12 months?

    Not at all [0] \cdot 11-point Likert \cdot [10] Extremely

{mdframed}
  1. 4.

    How would you describe your satisfaction of the overall quality of the acoustic environment in Singapore?

    Extremely dissatisfied [0] \cdot 11-point Likert \cdot [10] Extremely satisfied

{mdframed}
  1. 5.

    Shortened Weinstein Noise Sensitivity Scale222Items marked with single asterisks (*) were reverse coded. In other words, “Strongly disagree” was coded as “5” and “Strongly agree” was coded as “1” for these items. The asterisks were not shown in the digital form presented to the participants.: Below are a number of statements addressing individual reactions to noise. After reading each statement, please select the option that best represents your level of agreement with the statement.

    [1] Strongly disagree [2] Disagree [3] Neither agree nor disagree [4] Agree [5] Strongly agree

    1. (a)

      I complain once I have run out of patience with a noise.

    2. (b)

      I am annoyed even by low noise levels.

    3. (c)

      I would not want to live on a noisy street, even if the house was nice.

    4. (d)

      I get mad at people who make noise that keeps me from falling asleep or getting work done.

    5. (e)

      I cannot fall asleep easily when there is noise.

    6. (f)

      I am sensitive to noise.

    7. (g)

      I am easily awakened by noise.

    8. (h)

      I get used to most noises without much difficulty.*

    9. (i)

      I find it hard to relax in a place that’s noisy.

    10. (j)

      I’m good at concentrating no matter what is going on around me.*

{mdframed}
  1. 6.

    Shortened Perceived Stress Scale: The questions in this scale ask you about your feelings and thoughts during the last month. In each case, you will be asked to indicate how often you felt or thought a certain way. Although some of the questions are similar, there are differences between them and you should treat each one as a separate question. The best approach is to answer each question fairly quickly. That is, don’t try to count up the number of times you felt a particular way, but rather indicate the alternative that seems like a reasonable estimate.

    [0] Never [1] Almost never [2] Sometimes [3] Fairly often [4] Very often

    In the last month, how often have you…

    1. (a)

      been upset because of something that happened unexpectedly?

    2. (b)

      felt that you were unable to control the important things in your life?

    3. (c)

      felt nervous and “stressed”?

    4. (d)

      felt confident about your ability to handle your personal problems?

    5. (e)

      felt that things were going your way?

    6. (f)

      found that you could not cope with all the things that you had to do?

    7. (g)

      been able to control irritations in your life?

    8. (h)

      felt that you were on top of things?

    9. (i)

      been angered because of things that were outside of your control?

    10. (j)

      felt difficulties were piling up so high that you could not overcome them?

{mdframed}
  1. 7.

    WHO-5 Well Being Index: For each of the statements below, which is the closest to how you have been feeling over the last two weeks?

    [0] At no time [1] Some of the time [2] Less than half of the time [3] More than half of the time [4] Most of the time [5] All of the time

    1. (a)

      I have felt cheerful and in good spirits.

    2. (b)

      I have felt calm and relaxed.

    3. (c)

      I have felt active and vigorous.

    4. (d)

      I woke up feeling fresh and rested.

    5. (e)

      My daily life has been filled with things that interest me.

{mdframed}
  1. 8.

    Positive and Negative Affect Schedule333Items corresponding to Positive Affect (marked with the oplus symbol ) and Negative Affect (marked with the ominus symbol ) were tallied to give separate Positive Affect and Negative Affect scores. The oplus and ominus symbols were not shown in the digital form presented to the participants.: In the last two weeks, to what extent have you felt this way?

    [1] Very slightly or not at all [2] A little [3] Moderately [4] Quite a bit [5] Extremely

    a) Interested b) Distressed c) Excited d) Upset e) Strong f) Guilty g) Scared h) Hostile i) Enthusiastic j) Proud k) Irritable l) Alert m) Ashamed n) Inspired o) Nervous p) Determined q) Attentive r) Jittery s) Active t) Afraid

Appendix C Distributions and Statistical Test Results for PIQ and ARQ

This appendix collates the detailed distributions and statistical test results for the PIQ and ARQ described in Section 4. Fig. C.1 and Fig. C.2 respectively consolidate the distributions of the PIQ items coded as continuous and categorical variables, as violin plots by fold in the cross-validation set. Fig. C.3 presents the distributions of the single-value metrics described in Section 3.9, as violin plots by fold in the cross-validation set and test set. Tables C.1 and C.2 respectively consolidate the χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Kruskal-Wallis test results for the PIQ items coded as continuous and categorical variables, and Table C.3 presents the Kruskal-Wallis test results for the single-value metrics described in Section 3.9.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure C.1: Violin plots of distributions of PIQ responses coded as continuous variables, by fold: (a) Age, (b) Extent annoyed by noise over past 12 months, (c) Satisfaction of overall acoustic environment in Singapore, (d) Score on WNSS-10, (e) Score on PSS-10, (f) Score on WHO-5, (g) Score on PANAS for Positive Affect, (h) Score on PANAS for Negative Affect. Red horizontal dotted lines denote the maximum and minimum values theoretically attainable by the variables. Kruskal-Wallis tests found no significant differences (p>0.05𝑝0.05p>0.05italic_p > 0.05) between distributions of all variables by fold.
TABLE C.1: Summary of results of Kruskal-Wallis tests for PIQ responses coded as continuous variables, by fold. The test statistic for the Kruskal-Wallis test is denoted by H𝐻Hitalic_H and the corresponding p𝑝pitalic_p-values are given as p𝑝pitalic_p. No significant differences (p>0.05𝑝0.05p>0.05italic_p > 0.05) were observed between distributions of all variables by fold, regardless of whether the test set was included as an additional fold.
Cross-validation set Cross-validation set + test set
Variable H𝐻Hitalic_H p𝑝pitalic_p H𝐻Hitalic_H p𝑝pitalic_p
(a) Age     1.6519     0.7994      4.4477      0.4869
(b) Extent annoyed by noise over past 12 months     3.2587     0.5155      3.3215      0.6506
(c) Satisfaction of overall acoustic environment in Singapore     4.7949     0.3090      5.0618      0.4084
(d) Score on WNSS-10     2.3861     0.6651      2.4252      0.7877
(e) Score on PSS-10     1.1454     0.8870      2.5283      0.7722
(f) Score on WHO-5     2.7427     0.6018      4.0850      0.5372
(g) Score on PANAS for Positive Affect     2.0961     0.7181      3.2124      0.6673
(h) Score on PANAS for Negative Affect     1.2815     0.8645      1.3094      0.9340
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure C.2: Stacked bar plots of distributions of PIQ responses coded as categorical variables, by fold: (a) If participant was multilingual, (b) If participant was a native English speaker (N.A. if monolingual), (c) If participant considered themselves most fluent in English (N.A. if monolingual), (d) Gender, (e) Ethnicity, (f) Highest education level obtained, (g) Occupational status, (h) Current education level (N.A. if not student), (i) Dwelling type, (j) If participant was a Singapore citizen, (k) Length of residence in Singapore. Chi-squared tests found no significant differences (p>0.05𝑝0.05p>0.05italic_p > 0.05) between distributions of all variables by fold, except for ethnicity in the test set. Abbreviations in legend entries: Poly. = Polytechnic, Bach. Deg. = Bachelor’s degree, Pub. Apt. = Public apartment, Priv. Pty. = Private property, Priv. Apt = Private apartment.
TABLE C.2: Summary of results of χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-tests for PIQ responses coded as categorical variables, by fold. The test statistic for the χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-test is denoted by χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the corresponding p𝑝pitalic_p-values are given as p𝑝pitalic_p. Significant differences (p<0.05𝑝0.05p<0.05italic_p < 0.05) were observed only for the test set in terms of the ethnicity of the participants.
Fold
Test 1 2 3 4 5
Variable χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p𝑝pitalic_p χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p𝑝pitalic_p χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p𝑝pitalic_p χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p𝑝pitalic_p χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p𝑝pitalic_p χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p𝑝pitalic_p
(a) If participant was multilingual 0.6114 0.4343 0.2443 0.6211 0.0153 0.9017 0.1870 0.6654 0.0153 0.9017 0.0344 0.8530
(b) If participant was a native English speaker 0.8026 0.6694 0.3707 0.8308 1.9425 0.3786 1.3122 0.5189 0.6725 0.7145 1.6799 0.4317
(c) If participant considered themselves most fluent in English 1.7748 0.4117 1.3885 0.4995 0.6943 0.7067 0.4855 0.7845 0.5547 0.7578 0.2225 0.8947
(d) Gender 1.3607 0.2434 0.2630 0.6081 2.5980 0.1070 0.5918 0.4417 0.3435 0.5578 0.5918 0.4417
(e) Ethnicity 11.7723 0.0082 1.0508 0.7890 3.5286 0.3171 0.6204 0.8917 0.9676 0.8091 2.6249 0.4531
(f) Highest education level obtained 3.3766 0.7603 3.8880 0.6918 1.5046 0.9592 3.7658 0.7083 1.7228 0.9433 2.9474 0.8154
(g) Occupational status 0.2967 0.9900 3.0213 0.5543 3.3845 0.4957 5.3639 0.2520 2.8398 0.5850 0.8643 0.9296
(h) Current education level 0.8517 0.9736 2.1174 0.8327 0.6463 0.9858 5.0025 0.4156 2.5056 0.7756 4.2588 0.5128
(i) Dwelling type 1.8017 0.6146 3.1992 0.3619 0.7510 0.8612 2.9759 0.3954 1.5848 0.6628 1.0186 0.7968
(j) If participant was a Singapore citizen 0.4287 0.5126 0.0829 0.7734 1.3259 0.2495 1.3259 0.2495 2.3152 0.1281 0.2435 0.6217
(k) Length of residence in Singapore 0.7960 0.3723 0.0300 0.8626 0.0674 0.7951 1.7997 0.1798 1.0787 0.2990 0.1517 0.6969
TABLE C.3: Summary of results of Kruskal-Wallis tests for consistency checks on ARQ responses, by fold. The test statistic for the Kruskal-Wallis test is denoted by H𝐻Hitalic_H and the corresponding p𝑝pitalic_p-values are given as p𝑝pitalic_p. No significant differences (p>0.05𝑝0.05p>0.05italic_p > 0.05) were observed between distributions of all variables by fold, regardless of whether the test set was included as an additional fold. MAD stands for mean absolute deviation. MSE stands for mean squared error. Except (a), the MAD and MSE were computed by taking the mean across all stimuli presented to single participant.

Cross-validation set Cross-validation set + test set Variable H𝐻Hitalic_H p𝑝pitalic_p H𝐻Hitalic_H p𝑝pitalic_p (a) MAD between ARQ responses on first (“pre-experiment”) and last (“post-experiment”) stimuli    1.7164    0.7877      2.9306      0.7107 (b) MAD between reversed “pleasant” (6rpl6subscript𝑟pl6-r_{\text{pl}}6 - italic_r start_POSTSUBSCRIPT pl end_POSTSUBSCRIPT) and “annoying” (ransubscript𝑟anr_{\text{an}}italic_r start_POSTSUBSCRIPT an end_POSTSUBSCRIPT)    5.0281    0.2844      9.7446      0.0828 (c) MAD between reversed “eventful” (6rev6subscript𝑟ev6-r_{\text{ev}}6 - italic_r start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT) and “uneventful” (ruesubscript𝑟uer_{\text{ue}}italic_r start_POSTSUBSCRIPT ue end_POSTSUBSCRIPT)    5.1821    0.2691      5.6611      0.3406 (d) MAD between reversed “calm” (6rca6subscript𝑟ca6-r_{\text{ca}}6 - italic_r start_POSTSUBSCRIPT ca end_POSTSUBSCRIPT) and “chaotic” (rchsubscript𝑟chr_{\text{ch}}italic_r start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT)    1.8261    0.7677      5.5218      0.3556 (e) MAD between reversed “vibrant” (6rvi6subscript𝑟vi6-r_{\text{vi}}6 - italic_r start_POSTSUBSCRIPT vi end_POSTSUBSCRIPT) and “monotonous” (rmosubscript𝑟mor_{\text{mo}}italic_r start_POSTSUBSCRIPT mo end_POSTSUBSCRIPT)    5.3364    0.2545      5.3441      0.3753 (f) MSE between “pleasant” (rplsubscript𝑟plr_{\text{pl}}italic_r start_POSTSUBSCRIPT pl end_POSTSUBSCRIPT) and rescaled “ISO Pleasantness” (3+2P32𝑃3+2P3 + 2 italic_P)    1.3749    0.8485      2.0420      0.8433 (g) MSE between “eventful” (revsubscript𝑟evr_{\text{ev}}italic_r start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT) and rescaled “ISO Eventfulness” (3+2E32𝐸3+2E3 + 2 italic_E)    4.8070    0.3077      6.9141      0.2271

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure C.3: Violin plots of distributions of consistency metrics by fold: (a) Mean absolute difference (MAD) between ARQ responses on first (“practice”) and last (“consistency check”) stimuli, (b) MAD between reversed “pleasant” (6rpl6subscript𝑟pl6-r_{\text{pl}}6 - italic_r start_POSTSUBSCRIPT pl end_POSTSUBSCRIPT) and “annoying” (ransubscript𝑟anr_{\text{an}}italic_r start_POSTSUBSCRIPT an end_POSTSUBSCRIPT) responses across all stimuli, (c) MAD between reversed “eventful” (6rev6subscript𝑟ev6-r_{\text{ev}}6 - italic_r start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT) and “uneventful” (ruesubscript𝑟uer_{\text{ue}}italic_r start_POSTSUBSCRIPT ue end_POSTSUBSCRIPT) responses across all stimuli, (d) MAD between reversed “calm” (6rca6subscript𝑟ca6-r_{\text{ca}}6 - italic_r start_POSTSUBSCRIPT ca end_POSTSUBSCRIPT) and “chaotic” (rchsubscript𝑟chr_{\text{ch}}italic_r start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT) responses across all stimuli, (e) MAD between reversed “vibrant” (6rvi6subscript𝑟vi6-r_{\text{vi}}6 - italic_r start_POSTSUBSCRIPT vi end_POSTSUBSCRIPT) and “monotonous” (rmosubscript𝑟mor_{\text{mo}}italic_r start_POSTSUBSCRIPT mo end_POSTSUBSCRIPT) responses across all stimuli, (f) Mean squared error (MSE) between “pleasant” (rplsubscript𝑟plr_{\text{pl}}italic_r start_POSTSUBSCRIPT pl end_POSTSUBSCRIPT) and rescaled “ISO Pleasantness” (3+2P32𝑃3+2P3 + 2 italic_P) responses across stimuli, (g) MSE between “eventful” (revsubscript𝑟evr_{\text{ev}}italic_r start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT) and rescaled “ISO Eventfulness” (3+2E32𝐸3+2E3 + 2 italic_E) responses across stimuli.

Appendix D Details of Principal Component Analysis for Fold Allocation

This appendix presents heat maps of the weights given to each input feature when the principal component analysis (PCA) in the fold allocation process described in Section 3.3 was conducted for the set of base urban soundscapes, as well as the sets of maskers in each class. Only the first k𝑘kitalic_k principal components for each set are shown in the heat maps, where k𝑘kitalic_k is the number of principal components that together explained at least 90% of the observed variance.

Since the PCA was conducted independently for the set of base urban soundscapes, as well as the sets of maskers in each class, the assumption here was that each set had its own independent PCA that best summarized the input features for that set. We can observe from the heat maps that this is indeed the case, with the weights being significantly different between the sets. In addition, we can also see that weights with relatively higher absolute values are clustered together in blocks representing each class of psychoacoustic indicators (e.g., principal component #2 for the set of soundscape tracks has weights with comparatively high absolute values for the sharpness-related indicators), which indicates that the principal component dimensions appear to be summarizing related features into the individual principal components. This is as expected since the purpose of applying PCA in Section 3.3 was purely for dimensionality reduction, in order to reduce the number of (redundant) parameters for the subsequent clustering via self-organizing maps.

Refer to caption
Refer to caption
Figure D.1: Heat maps of (top) actual values, (bottom) absolute values of the first 9 principal components of the base urban soundscape recordings.
Refer to caption
Refer to caption
Figure D.2: Heat maps of (top) actual values, (bottom) absolute values of the first 13 principal components of the maskers in the bird class.
Refer to caption
Refer to caption
Figure D.3: Heat maps of (top) actual values, (bottom) absolute values of the first 9 principal components of the maskers in the construction class.
Refer to caption
Refer to caption
Figure D.4: Heat maps of (top) actual values, (bottom) absolute values of the first 8 principal components of the maskers in the traffic class.
Refer to caption
Refer to caption
Figure D.5: Heat maps of (top) actual values, (bottom) absolute values of the first 8 principal components of the maskers in the water class.
Refer to caption
Refer to caption
Figure D.6: Heat maps of (top) actual values, (bottom) absolute values of the first 6 principal components of the maskers in the wind class.