ARAUS: A Large-Scale Dataset and
Baseline Models of Affective Responses to
Augmented Urban Soundscapes

Kenneth Ooi, Zhen-Ting Ong, Karn N. Watcharasupat, Bhan Lam,
Joo Young Hong, and Woon-Seng Gan K. Ooi, Z.-T. Ong, B. Lam, and W.-S. Gan are with the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. Emails: {wooi002, ztong, bhanlam, ewsgan}@ntu.edu.sg. K. N. Watcharasupat was with the School of Electrical and Electronic Engineering, NTU, Singapore. She is currently with the Center for Music Technology, Georgia Institute of Technology, Atlanta, GA, USA. Email: kwatcharasupat@gatech.edu. J. Y. Hong is with the Department of Architectural Engineering, Chungnam National University, Daejeon, Republic of Korea. Email: jyhong@cnu.ac.kr. The research protocols used in this research were approved by the NTU Institutional Review Board (Ref. IRB-2020-08-035). The ARAUS dataset (including rejected data) is publicly available in the NTU research data repository DR-NTU (Data) at https://doi.org/10.21979/N9/9OTEVX. Replication code for analysis and baseline models in this paper is available at https://github.com/ntudsp/araus-dataset-baseline-models. ©2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Manuscript received Jul 3, 2022; revised Dec 22, 2022; accepted Feb 9, 2023.

Abstract

Choosing optimal maskers for existing soundscapes to effect a desired perceptual change via soundscape augmentation is non-trivial due to extensive varieties of maskers and a dearth of benchmark datasets with which to compare and develop soundscape augmentation models. To address this problem, we make publicly available the ARAUS (Affective Responses to Augmented Urban Soundscapes) dataset, which comprises a five-fold cross-validation set and independent test set totaling 25,440 unique subjective perceptual responses to augmented soundscapes presented as audio-visual stimuli. Each augmented soundscape is made by digitally adding “maskers” (bird, water, wind, traffic, construction, or silence) to urban soundscape recordings at fixed soundscape-to-masker ratios. Responses were then collected by asking participants to rate how pleasant, annoying, eventful, uneventful, vibrant, monotonous, chaotic, calm, and appropriate each augmented soundscape was, in accordance with ISO/TS 12913-2:2018. Participants also provided relevant demographic information and completed standard psychological questionnaires. We perform exploratory and statistical analysis of the responses obtained to verify internal consistency and agreement with known results in the literature. Finally, we demonstrate the benchmarking capability of the dataset by training and comparing four baseline models for urban soundscape pleasantness: a low-parameter regression model, a high-parameter convolutional neural network, and two attention-based networks in the literature.

Index Terms:

Soundscape, dataset, regression, deep neural network, soundscape augmentation, auditory masking

1 Introduction

Soundscape augmentation involves the addition of sounds, typically referred to as “maskers”, to existing acoustic environments in an effort to change the perception that real or hypothetical listeners may have towards them. This perception-based approach to noise mitigation was developed upon findings that indicators based purely on the sound pressure level are inefficient or ineffective at influencing the perception of soundscapes [1, 2, 3, 4]. For instance, based on a synthesis of studies in the literature, [5] found that at the same day-night level above $52\text{\,}\mathrm{dB}$ , the percentage of people rating aircraft noise as “highly annoying” significantly exceeded that for road traffic noise, and the percentage for road traffic noise in turn significantly exceeded that for rail noise. This is further backed by a numerical study conducted in [6].

Changes in perception have ordinarily been measured using subjective ratings of descriptors such as “soundscape quality” [7], “preference” [8], “perceived loudness” [9], “vibrancy” [10], and other adjectives described in the ISO/TS 12913-3:2019 circumplex model of soundscape perception [11]. These ratings, when aggregated over multiple human participants or soundscapes, typically form a set of indicators that soundscape practitioners can use to guide interventions or analyses of a given soundscape.

However, a perennial concern lies in choosing “optimal” or “appropriate” maskers to optimize the value of a given perceptual indicator, given the large varieties of possible maskers. This is important in real-life applications of soundscape augmentation systems, such as an automatic masker system that plays back maskers in real time in time-varying urban acoustic environments [12, 13].

Most prior studies have primarily addressed this concern by expert-guided or post hoc analysis [14, 15], but the number of maskers and masker types under consideration at a time tends to be relatively small, on the order of tens at a time. This thus limits the choice of possible maskers for different scenarios and the generalizability of conclusions of soundscape augmentation studies to more diverse real-world scenarios.

Hence, a dataset of subjective perceptual responses to a broad variety of augmented urban soundscapes, comprising a large set of urban soundscapes and a correspondingly diverse set of maskers, would pave the way to systematic investigations of suitable maskers and masker types for urban soundscapes in general, as well as the development of better models for soundscape augmentation. Such a dataset could also alternatively be used for soundscape studies desiring a representative sample of soundscapes from a perceptual space or with given characteristics, such as those performed by [16] for pleasantness, [17] for soundscape quality, and [18] for validating translations of perceptual attributes.

However, to the best of our knowledge, there is currently no common, consolidated benchmark dataset for fair cross-comparison of prediction models in the soundscape literature. According to a systematic review of prediction models by [19], the largest datasets used by prediction models for soundscape perception in the literature also appear to individually number from the thousands to about ten thousand samples. As such, this work aims to craft a large-scale, public dataset of affective responses to soundscapes which can serve as a benchmark for designing models to guide masker choice in soundscape augmentation.

Our proposed Affective Responses to Augmented Urban Soundscapes (ARAUS) strives to achieve this by using urban soundscapes covering a wide range of acoustic environments and maskers covering several sound types previously investigated in the soundscape literature. Additionally, the data collection protocol is compliant with the ISO/TS 12913-2:2018 [20] standard and is designed to be replicable with minimal specialized equipment, ensuring that this dataset can be extended by as many research groups as possible.

Due to the scope and nature of the proposed dataset, the collected data can also secondarily serve to empirically verify findings from existing soundscape literature. Additionally, the size of the dataset allows for relatively larger machine-learning models to be investigated along with traditional low-parameter models.

This article is organized as follows. Section 2 gives an overview of related soundscape datasets in the literature. Section 3 describes the stimuli generation and data collection method used. Section 4 describes exploratory and confirmatory data analyses of the collected responses. Section 5 presents benchmark models for prediction of perceptual responses to augmented soundscapes. Section 6 discusses the limitations and future directions for this work. Finally, Section 7 summarizes the findings of this work and makes some concluding remarks.

2 Related Datasets

Publicly available, large-scale audio datasets have been made available as unlabeled data and/or labeled data for several “objective” acoustic tasks such as acoustic scene classification [21, 22], sound event localization and detection [23], and anomaly detection [24]. However, affective audio datasets tend to be smaller and rarely reach the scale of those developed for objective tasks. In particular, to the best of our knowledge, there is no publicly-available affective soundscape dataset at the scale of the proposed ARAUS dataset.

In this section, a selection of large-scale and affective audio datasets is reviewed. A brief overview of key details for the datasets discussed is shown in Table I.

2.1 Large-scale Audio Datasets

The largest curated audio dataset is arguably AudioSet, which at the time of its initial release contained 1.7M 10-second segments of audio from YouTube videos organized in a systematic ontology for audio tagging [25]. Strong labels have subsequently been provided for a subset of audio [26], and at the time of writing, AudioSet has grown to about 2.1M segments, with 120K being strongly labeled [27]. However, the publicly available version of the data constituting AudioSet is in the form of $128$ -dimensional VGGish features [28] due to potential copyright issues related to the use of raw audio from YouTube videos, which limits its use as a public dataset.

Alternatives to AudioSet include UrbanSound8K [29], ESC-50 [30], and FSD50K [31], which contain labelled Creative Commons-licensed tracks from Freesound [32] and serve as publicly-available benchmark datasets for weak audio classification. However, the nature of the stimuli in these datasets tends to be monophonic or same-class polyphonic. While this is useful in reducing the complexity and noise in inputs to train robust sound event classification models, the full complexities of real-life acoustic environments necessary for soundscape research are rarely represented in these datasets. Individual monophonic stimuli may be used to compose synthetic soundscapes like in the URBAN-SED dataset [33], which used the tracks in UrbanSound8K, but the focus of UrbanSound8K on just 10 possible event classes may be insufficient to emulate the variety of sound sources possible in real-life urban environments.

Therefore, multiple efforts have been made to record real-life acoustic environments, which are generally polyphonic in nature, and provide them as publicly available datasets for use in sound and soundscape research. These include Urban Soundscapes of the World (USotW) [34]; EigenScape [35]; SONYC Urban Sound Tagging (SONYC-UST) [36] and SONYC-UST-V2 [37]; TUT Acoustic Scenes [38, 39] and TAU Urban Acoustic Scenes (TAU-UAS) [40, 41, 42]; Singapore Polyphonic Urban Audio (SINGA:PURA) [43]; Ambisonics Recordings of Typical Environments (ARTE) [44]; and Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) [45].

However, the aforementioned datasets contain no corresponding “subjective” labels concerning the affective perception of the recorded environments. This limits their use as datasets for soundscape augmentation, because knowing how individual soundscapes are perceived by humans is crucial in analyzing and modeling their perception. In addition, with the exception of USotW, the recordings were not compliant with ISO/TS 12913-2:2018 [20] due to them preceding the publication of the standard, or being made for a different purpose. Hence, they may not be immediately suitable for use under the ISO 12913 paradigm, which the ARAUS dataset was designed under.

2.2 Affective Sound Datasets

Nonetheless, audio datasets with labels specific to perceptual indicators of the stimuli also exist. For example, the International Affective Digitized Sounds (IADS) dataset [46] has had labels for discrete emotional categories elicited by the $167$ individual stimuli provided by [47], with an expanded version (IADS-E) provided by [48]. The largely monophonic stimuli in IADS, however, suffer from the same drawbacks as datasets based on Freesound. As a workaround, the Emo-Soundscapes dataset [49] used individual clips from Freesound to synthetically generate $1213$ soundscapes, each $6\text{\,}\mathrm{s}$ long, and obtained corresponding valence-arousal labels based on the Self-Assessment Manikin (SAM) [50] from participants on the CrowdFlower platform.

TABLE I: Selection of datasets related to the ARAUS dataset. Datasets with multiple versions only have their latest or most complete version listed.

Dataset		Year	Samples	Length	Locations/Sources	Labels
AudioSet	[25]	2017	2.1M	$10\text{\,}\mathrm{s}$	From YouTube	Weak sound events
AudioSet Strong	[27]	2021	120K	$10\text{\,}\mathrm{s}$	From AudioSet	Strong sound events
UrbanSound8K	[29]	2014	8.7K	$\leq$ $4\text{\,}\mathrm{s}$	From Freesound	Weak sound events
ESC-50	[30]	2015	2.0K	$5\text{\,}\mathrm{s}$	From Freesound	Weak sound events
URBAN-SED	[33]	2017	10.0K	$10\text{\,}\mathrm{s}$	From UrbanSound8K	Strong sound events
FSD50K	[31]	2020	51.2K	$\leq$ $30\text{\,}\mathrm{s}$	From Freesound	Weak sound events
SONYC-UST-V2	[37]	2020	18.5K	$10\text{\,}\mathrm{s}$	Public locations in New York City	Weak sound events
SINGA:PURA	[43]	2021	6.5K	$10\text{\,}\mathrm{s}$	Public locations in Singapore	Strong sound events
STARSS22	[45]	2022	173	$\leq$ $300\text{\,}\mathrm{s}$	Indoor environments in Tampere & Tokyo	Strong sound events + direction of arrival
USotW	[34]	2017	127	$60\text{\,}\mathrm{s}$	Urban public spaces in 9 cities worldwide	Acoustic scenes
EigenScape	[35]	2017	64	$600\text{\,}\mathrm{s}$	Public locations in North England	Acoustic scenes
ARTE	[44]	2019	13	$120\text{\,}\mathrm{s}$	Indoor environments in Sydney	Acoustic scenes
TAU-UAS	[41]	2019	23.0K	$10\text{\,}\mathrm{s}$	Urban public spaces in 12 European cities	Acoustic scenes
IADS-2	[46]	2007	167	$6\text{\,}\mathrm{s}$	From digital recordings	SAM
IADS-E	[48]	2018	935	$6\text{\,}\mathrm{s}$	From IADS-2, Internet, or composer	SAM + basic emotion ratings
Emo-Soundscapes	[49]	2017	1.2K	$6\text{\,}\mathrm{s}$	From Freesound	SAM
ATHUS	[51]	2019	978	$\leq$ $79\text{\,}\mathrm{s}$	Various locations in Athens	Subjective soundscape quality
ISD	[52]	2021	1.2K	$30\text{\,}\mathrm{s}$	Urban public spaces in 13 European cities	Affective + other labels per SSID Protocol [53]
ARAUS [Proposed]		2022	25.4K	$30\text{\,}\mathrm{s}$	USotW, public locations in Singapore (base)	Affective responses
					Freesound, Xeno-canto (maskers)	(+ contextual information, see Section 3)

In contrast, datasets with perceptual labels for real-life audio recordings include the Athens Urban Soundscape (ATHUS) dataset [51], which contains $978$ crowd-sourced recordings with corresponding labels for subjectively-rated soundscape quality on a five-point Likert scale, as well as the International Soundscape Database (ISD) [52], which as of v0.2.1, consists of $1258$ $30$ -second long recordings in $13$ European cities and corresponding perceptual responses collected using the Soundscape Indices (SSID) Protocol [53]. The perceptual responses collected using the SSID Protocol were largely inspired by the Method A questionnaire in ISO/TS 12913-2:2018 [20].

However, the relatively smaller sample sizes of these datasets, as compared to large-scale datasets like AudioSet, may preclude their use in developing high-parameter models, such as deep neural networks, which have shown state-of-the-art performance in various “objective” acoustic tasks. Hence, the ARAUS dataset was designed at a relatively large scale to be amenable to high-parameter models.

3 Data Collection Methodology

Since the primary focus of this study is to create a database of affective responses to augmented soundscapes that is extensive yet extensible, the data collection methodology must necessarily be modular and repeatable to ensure valid analysis of results and enable possible future extensions to the dataset, similar to the philosophy of USotW [34] and ISD [52].

Hence, the ARAUS dataset was designed as a five-fold cross-validation dataset with an independent test set, such that additional folds or data for each fold can be added following the same data collection methodology described in this section. Such a design also allows for each fold to be treated as an independent dataset for training of ensemble models or meta-learning.

To design the cross-validation set, we prepared separate recordings of real-life “base” urban soundscapes and of maskers that could potentially be used to augment those “base” urban soundscapes. The same procedure was used to split the urban soundscape and masker recordings independently into five folds to ensure no data leakage, and combine them within their folds into augmented soundscapes that form the audio-visual stimuli in the ARAUS dataset. The audio-visual stimuli were presented to participants in laboratory conditions and their affective responses to the stimuli were collected. However, responses belonging to participants who responded in an “overly inconsistent” manner were dropped from the dataset (see Section 3.9) and new participants were recruited to replace the dropped responses such that each fold in the five-fold cross-validation set had an equal number of data samples. A summary of the data collection methodology is shown in Fig. 1, and full details of each step are provided in the following subsections.

The test set was designed in the same manner as the cross-validation set, but used no urban soundscapes, maskers, or participants already in the cross-validation set.

Refer to caption — Figure 1: Framework of the study methodology.

3.1 Base Urban Soundscapes

For the five-fold cross-validation set, all base urban soundscapes were taken from the Urban Soundscapes of the World (USotW) database [34]. The USotW database contains $127$ publicly available recordings of urban soundscapes covering an extensive variety of urban environments from various cities around the world. The urban environments range from parks to busy streets, and the cities include those in Asia, Europe, and North America, which allows the ARAUS dataset to be broad-ranging in its coverage of real-life urban soundscapes. The urban soundscapes in the USotW dataset were also chosen by the USotW team via clustering of locations reported by local experts to be “full of life and exciting”, “chaotic and restless”, “calm and tranquil”, and “lifeless and boring”, which are adjectives spanning the perceptual space generated by the “Pleasantness” and “Eventfulness” axes of the ISO/TS 12913-3:2019 circumplex model [11]. This heightens the suitability of the USotW database for use in the investigation of the affective qualities of soundscapes that the ARAUS dataset aims to enable.

Each $60$ -second binaural recording in the USotW database was split into two halves of $30\text{\,}\mathrm{s}$ for the creation of the audio-visual stimuli in the ARAUS dataset. Consequently, the audio-visual stimuli in the ARAUS dataset are all $30\text{\,}\mathrm{s}$ in length, which is in line with the stimulus length used in several soundscape studies, such as those in [54, 55, 56]. Upon splitting the recordings into two halves, we discarded any half with (1) audible electrical noise (such as those caused by a faulty microphone or loose connection), in order to reflect only accurately-captured real-life soundscapes; (2) measured in-situ $L_{\text{A,eq}}$ values below $52\text{\,}\mathrm{dB}$ , in order to ensure that reproduction levels were significantly above the noise floor of the laboratory location with the highest noise floor (about $37\text{\,}\mathrm{dB}$ ; see Fig. 3) where the subjective responses were obtained; and/or (3) measured in-situ $L_{\text{A,eq}}$ values above $77\text{\,}\mathrm{dB}$ in order to ensure safe listening levels [57] for the participants.

For each half of the binaural recordings that was not discarded, a $0\text{\,}\mathrm{\SIUnitSymbolDegree}$ -azimuth, $0\text{\,}\mathrm{\SIUnitSymbolDegree}$ -elevation field of view (FoV) was cropped out of their corresponding $360\text{\,}\mathrm{\SIUnitSymbolDegree}$ -videos from the USotW database to form the audio-visual stimulus corresponding to that urban soundscape.

For the test set, six urban soundscapes were recorded in locations within Nanyang Technological University (NTU), Singapore. The recordings were made in a similar manner as those in the USotW database and were made using equipment in accordance with the SSID Protocol [53]. Post-processing of the test set recordings for use as part of the stimuli for the ARAUS dataset was done in the same manner as that for the five-fold cross-validation set.

In total, this formed a base of $234$ urban soundscape recordings for the five-fold cross-validation set of the ARAUS dataset, and an additional base of six soundscapes (not overlapping with the other $234$ ) for the independent test set. Further details on the exact recordings used can be found in Appendix A.

3.2 Maskers

The masker recordings for both the five-fold cross-validation set and the independent test set were derived from source tracks found in the public databases Freesound [32] and Xeno-canto [58]. Both databases host tracks with Creative Commons licenses, with Xeno-canto being a repository of bird calls and Freesound being a more general repository of sound samples and recordings.

The source tracks from both databases that we determined to be relevant to the ARAUS dataset fell into one of the following classes: bird, construction, traffic, water, and wind. Water [59, 60, 61], bird [54, 61, 62], and wind [63] sounds have previously been investigated in soundscape studies as natural-sound maskers. On the other hand, sounds from traffic and construction are ubiquitous noise sources in urban environments [64], and are commonly investigated in soundscape literature [65, 66, 67]. Therefore, the selection of maskers covers a variety of urban sounds investigated in the soundscape literature for the ARAUS dataset.

The source tracks for maskers corresponding to the “bird” class were first obtained by randomly picking a selection of high-quality tracks of birds on Xeno-canto. Each source track corresponded to bird(s) from a single species, as labeled on Xeno-canto. Additional tracks for the “bird” class and all other classes were obtained via the corresponding search term on Freesound, and picking a selection of “high-quality” tracks containing $30$ -second sections of sound that corresponded only to that particular masker class, as determined by manual listening. However, the exact number of sources of a single masker class present in a given track was variable, so for instance, a given track of the “bird” class could contain vocalizations from one, two, or more birds, but all tracks of the “bird” class contain only bird vocalizations. Further details on the source tracks used to create the maskers are given in Appendix A.

Each source track was then processed individually to create 30-second single-channel masker recordings. Single-channel recordings of maskers were used because [68] previously found single-channel recordings to be sufficient to replicate the perceived affective quality of soundscapes, which the ARAUS dataset aims to collect responses for. For source tracks that were originally multi-channel, only the first channel was used for consistency. Source tracks originally longer than $30\text{\,}\mathrm{s}$ were trimmed to $30\text{\,}\mathrm{s}$ , while those originally shorter than $30\text{\,}\mathrm{s}$ were either padded with silence or looped. Finally, noise reduction via spectral gating and high-pass filtering was performed for source tracks in the “bird” class to reduce ambient and/or microphone noise in the track. All pre-processing was done manually using Audacity (v2.3.2).

In total, this formed a set of $280$ masker candidates ( $56$ per fold) that were used to generate the stimuli for the five-fold cross-validation set, and a set of seven maskers for the independent test set. The breakdown of the number of masker recordings by class was $80$ bird, $40$ construction, $40$ traffic, $80$ water, and $40$ wind for the cross-validation set; and $2$ bird, $1$ construction, $1$ traffic, $1$ water, $2$ wind for the independent test set.

3.3 Fold Allocation

After preparing the urban soundscape recordings and maskers in Sections 3.1 and 3.2, the tracks were assigned into the five folds of the cross-validation set such that the distributions of psychoacoustic properties of the urban soundscapes and maskers were similar across the five folds. Since psychoacoustic indicators of a given soundscape have non-trivial and non-spurious correlations with corresponding perceptual indicators [69], taking psychoacoustic indicators into account was necessary to minimize distributional shifts across the folds.

The assignment procedure consisted of the following steps carried out for the urban soundscape recordings, and independently each class of masker tracks.

3.3.1 Track Calibration

Each recording track was calibrated to a pre-defined A-weighted equivalent sound pressure level ( $L_{\text{A,eq}}$ ). For the base urban soundscapes, the in-situ $L_{\text{A,eq}}$ measured at the time of recording for the urban soundscape recordings was used, while a constant value of $65\text{\,}\mathrm{dB}$ was used for the maskers, similar to [70].

3.3.2 Acoustic and Psychoacoustic Indicator Computation

For all recordings, summary statistics for acoustic and psychoacoustic indicators, as recommended by ISO/TS 12913-3:2019 [11], were calculated independently for each channel using ArtemiS SUITE (HEAD Acoustics). The indicators comprised sharpness [71], loudness [72], fluctuation strength [73], roughness [73], tonality [74, 75], $L_{\text{A,eq}}$ [76], and C-weighted equivalent sound pressure level ( $L_{\text{C,eq}}$ ) [76]. Finally, band powers summed over third-octave bands with center frequencies from $5\text{\,}\mathrm{Hz}$ to $20\text{\,}\mathrm{kHz}$ were also calculated. Table II shows the summary statistics calculated for each indicator.

With the exception of tonality in [74], the MATLAB Audio Toolbox¹¹1https://www.mathworks.com/products/audio.html provides standards-compliant implementations of all other psychoacoustic indicators used. Open-source implementations of the psychoacoustic indicators used are either currently available, or have been indicated as planned, as part of the MOSQITO Toolbox [77].

TABLE II: Acoustic and psychoacoustic indicators used as channel-wise summary statistics. The set of summary statistics indicated as “common” were the mean, maximum, exceedance levels for the 5th percentile, exceedance levels for each decile, and exceedance levels for the 95th percentile. Minimum values for sharpness, loudness, fluctuation strength, roughness, and tonality were zero for all stimuli and hence omitted from analysis.

Indicator	Unit	Summary statistics
Sharpness [71]	acum	common
Loudness [72]	sone	common + root mean cube
Fluctuation strength [73]	vacil	common
Roughness [73]	asper	common
Tonality [74, 75]	tuHMS	common
$L_{\text{A,eq}}$ [76]	dB	common + minimum
$L_{\text{C,eq}}$ [76]	dB	common + minimum
Spectral powers	dB	third-octave band-wise sum (center freq. $5\text{\,}\mathrm{Hz}$ to $20\text{\,}\mathrm{kHz}$ )

3.3.3 Dimensionality Reduction

The summary statistics were then used as individual input features to a principal component analysis (PCA), and to project each recording to a principal component space with enough dimensions to achieve $90\text{\,}\mathrm{\char 37\relax}$ explained variance. Further details on the PCA can be found in Appendix D.

The primary reason for using PCA in the assignment of folds was to remove correlations between multiple variables, which is desirable for noise and dimensionality reduction prior to clustering as illustrated by [78], which removed correlated variables when clustering of soundscape recordings across 8 acoustic indices.

3.3.4 Clustering and Fold Assignment

With the coordinates of each recording in the principal component space, the recordings were organized into clusters of five using a self-organizing map (SOM) [79].

For each cluster of five recordings, each recording in the cluster was randomly assigned to a distinct fold of the cross-validation set. To prevent data leakage, the assignment was done based on psychoacoustic indicators computed from their original $60$ -second long binaural recordings from the USotW database, such that the $30$ -second halves in the ARAUS dataset originating from the same original binaural recording were always assigned to the same fold.

3.4 Generation of Stimuli

Each stimulus in the ARAUS dataset is an augmented soundscape to be presented as a $30$ -second audio-visual stimulus to a human participant, and the procedure used to generate each stimulus is shown in Fig. 2.

The audio for the augmented soundscapes was made by combining the $30$ -second binaural recordings of urban soundscapes in Section 3.1 with the $30$ -second single-channel recordings of maskers obtained in Section 3.2 at various gain levels, via element-wise addition of their respective gain-adjusted time-domain signals. For the purposes of stimulus generation, we also included silence in the set of possible maskers, since the addition of silence simply replicates the condition where no masker is added.

For the five-fold cross-validation set, the urban soundscapes and maskers were chosen randomly from the same fold for combination to prevent data leakage and provide a sample of all possible combinations. Since the fold allocation procedure described in Section 3.3 ensured that the sets of urban soundscapes and maskers used for each fold of the cross-validation set were disjoint, the augmented soundscapes generated from them were disjoint between each fold of the cross-validation set as well. On the other hand, for the independent test set, the urban soundscapes and maskers were exhaustively combined to cover all $48$ possible combinations.

Before combining the urban soundscape and masker recordings for the five-fold cross-validation set, the urban soundscape recordings were calibrated to the in-situ $L_{\text{A,eq}}$ levels measured at the time of recording, and the maskers were calibrated to specific soundscape-to-masker ratios (SMR) with respect to the urban soundscape, chosen randomly from the set $\{$ $-6$ , $-3$ , $0$ , $+3$ , $+6$ $\}$ $\text{\,}\mathrm{dB}$ . For example, if the urban soundscape recording had an in-situ $L_{\text{A,eq}}$ of $65\text{\,}\mathrm{dB}$ and an SMR of $+3\text{\,}\mathrm{dB}$ was randomly chosen, then the masker would be calibrated to an $L_{\text{A,eq}}$ of $62\text{\,}\mathrm{dB}$ . This gave a total of $65\,520$ possible augmented soundscapes from which we sampled for the ARAUS dataset. All calibration was done via the automated method described by [80]. On the other hand, for the independent test set, a fixed SMR of $0\text{\,}\mathrm{dB}$ was used for all combinations. This was to limit the number of stimuli presented to each participant in the test set and the effect of listener fatigue, since the participants assigned to the test set were required to rate all combinations exhaustively, as explained in Section 3.7.

Lastly, the single-channel recordings were added to each channel of the binaural tracks, in a manner similar to that in [81] with a fixed stereo panning coefficient of 0.5. This audio was then overlaid onto the $0\text{\,}\mathrm{\SIUnitSymbolDegree}$ -azimuth, $0\text{\,}\mathrm{\SIUnitSymbolDegree}$ -elevation field of view cropped from the $360\text{\,}\mathrm{\SIUnitSymbolDegree}$ video recordings from the USotW database taken at the same time as the binaural urban soundscape recordings to form the audio-visual stimuli.

3.5 Participant Recruitment

Prior ethical approval was obtained from the Institutional Review Board, NTU (Ref. IRB 2020-08-035) before participant recruitment and response collection. Participants were recruited via online messaging channels, posters, and emails.

In total, $642$ unique participants were recruited to provide their responses for the ARAUS dataset, of which $37$ ( $5.76\text{\,}\mathrm{\char 37\relax}$ ) had their responses rejected. Hence, responses from only $605$ participants were included in the final dataset. We rejected responses from participants who (1) failed a hearing test ( $19$ participants, $2.96\text{\,}\mathrm{\char 37\relax}$ ); (2) failed more than three out of seven consistency checks described in Section 3.9 ( $17$ participants, $2.65\text{\,}\mathrm{\char 37\relax}$ ); or (3) provided the same responses to any item in the Affective Response Questionnaire (ARQ) described in Section 3.7 for all stimuli they were presented, thereby providing no useful information to the dataset ( $3$ participants, $0.47\text{\,}\mathrm{\char 37\relax}$ ). Two of the rejected participants failed both the hearing test and more than three out of seven consistency checks.

Participants aged under $30$ were considered to have failed the hearing test if they had a mean threshold of hearing above $20\text{\,}\mathrm{dB}$ and participants aged $30$ and above were considered to have failed if they had a mean threshold of hearing above $30\text{\,}\mathrm{dB}$ via pure-tone audiometry using the uHear application on a mobile phone (Apple iPhone 4S) and earbuds (Apple EarPods). The tested frequencies were $0.5$ , $1$ , $2$ , $4$ , and $6\text{\,}\mathrm{kHz}$ . These were within the standard ranges used for screening in pure tone audiometry [82] and previous soundscape research [83]. The higher threshold of $30\text{\,}\mathrm{dB}$ was applied to participants aged $30$ and above to balance the risk of age bias (since age is highly correlated with hearing ability [84]) against the need to ensure that hearing loss did not interfere with the perception of the augmented soundscapes in the ARAUS dataset.

The participants were each assigned a fold such that each fold of the five-fold cross-validation set had responses from $120$ participants. The independent test set had responses from $5$ participants. The age and gender distributions of the participants are shown in Table 3, and further information on participant demographics can be found in Appendix C. Due to the test sites for the data collection process being located within university campuses (as shown in Fig. 3), a majority of the 605 participants whose responses were included in the ARAUS dataset were students ( $443$ participants, $73.2\text{\,}\mathrm{\char 37\relax}$ ) who were in the process of obtaining their bachelor’s degree ( $380$ participants, $62.8\text{\,}\mathrm{\char 37\relax}$ ). Participants were also relatively young (mean age $26.7$ years, standard deviation $10.0$ years) compared to those in the ISD (mean age $33.8$ years, standard deviation $14.57$ years) [85], but slightly older compared to those in the IADS-2 (college students) [46] and IADS-E (mean age 21.32 years, standard deviation 2.38 years) [48] datasets.

TABLE III: Age and gender³³3To safeguard their identities, the small number (2 participants,

0.33\text{\,}\mathrm{\char 37\relax}

) of participants identifying as neither male nor female were randomly assigned as “male” or “female” in the public release of the ARAUS dataset, each with a probability of

50\text{\,}\mathrm{\char 37\relax}

. distribution of participants
in ARAUS dataset by fold and in entirety.

	Fold
Statistic	Test	1	2	3	4	5	All
Sample size	5	120	120	120	120	120	605
# female	4	62	56	69	68	69	328
# male	1	58	64	51	52	51	277
Mean age	22.4	27.4	26.7	26.3	27.3	26.3	26.7
Std. dev. of age	5.5	10.2	9.8	10.4	11.0	8.7	10.0
Minimum age	18	19	18	18	18	18	18
Median age	21	24	24	23	23	24	24
Maximum age	32	63	71	68	65	60	71

3.6 Listening Conditions

Participants were presented with all audio-visual stimuli via closed-back headphones (Beyerdynamic Custom One Pro) powered by an external sound card (SoundBlaster E5). The video was presented via a $23$ -inch monitor (Philips 236E SoftBlue) and measured $21.5\text{\,}\mathrm{cm}$ by $12\text{\,}\mathrm{cm}$ on the screen. Participants sat facing about $1\text{\,}\mathrm{m}$ from the monitor.

Due to the large number of participants involved in the experiment, participants listened to the stimuli in one of four quiet rooms, three located in NTU and one in the Singapore University of Technology and Design (SUTD). Photos of each of the four quiet rooms are shown in Fig. 3.

3.7 Affective Response Questionnaire

For each audio-visual stimulus, participants were instructed to perform their evaluations given the following prompt:

Imagine that you are standing at the location shown in the video, listening to the sound environment playing through the headphones.

After completely experiencing the stimulus at least once, participants were presented with a set of $9$ questions from the Method A questionnaire in ISO/TS 12913-2:2018 [20], which we refer to as the “Affective Response Questionnaire” (ARQ). The first $8$ questions were related to the perceived affective quality of the sound they heard over the headphones:

To what extent do you agree or disagree that the present surrounding sound environment is {pleasant, eventful, chaotic, vibrant, uneventful, calm, annoying, monotonous}?

The last question was related to the appropriateness of the location depicted in the video they saw on the monitor with respect to the sound they heard over the headphones:

To what extent is the present surrounding sound environment appropriate to the pleasant place?

Participants responded to all questions on a five-point Likert scale via a computerized graphical user interface (GUI) depicted in Fig. 4, and we coded their responses to values in $\{1,2,3,4,5\}$ to match the scale.

For the cross-validation set, participants responded to the ARQ for $42$ unique, randomly-selected stimuli from the fold that they were assigned to. For the independent test set, all participants responded to the ARQ for the same 48 exhaustively-generated stimuli.

In addition to the “main” $42$ ( $48$ ) stimuli shown to each participant in the cross-validation (test) set, we presented each participant with three auxiliary stimuli. Participants were also required to provide responses to the ARQ for these stimuli, but the responses are not part of the ARAUS dataset, because these stimuli were the same for every participant, and did not serve the same purpose as the main stimuli in the ARAUS dataset. These stimuli were namely:

(1)

A “pre-experiment” stimulus, which was shown as the first stimulus before presenting the main stimuli for the cross-validation/test set.
(2)

An “attention” stimulus, which was identical to the “pre-experiment” stimulus and shown in between two randomly selected “main” stimuli for the cross-validation/test set. For this stimulus, the GUI had special instructions for the participant to choose specific options for the ARQ before they could proceed.
(3)

A “post-experiment” stimulus, which was identical to the “pre-experiment” stimulus and shown as the last stimulus after presenting the main stimuli for the cross-validation/test set. ARQ responses to this stimulus were used for internal consistency checks, as described in Section 3.9.

3.8 Participant Information Questionnaire

Since the ISO 12913-1:2014 definition of “soundscape” mandates that acoustic environments must be perceived in context [86], information specific to the listeners who participated in the study was also necessary to understand the context behind listener perceptions of the augmented soundscapes presented as stimuli. Hence, we administered another questionnaire, which we dub the “Participant Information Questionnaire” (PIQ), to each participant after they had completed the ARQs for the stimuli shown to them. The PIQ consisted of items related to basic demographic information and standard psychological questionnaires.

Items related to basic demographic information consisted of age, gender, spoken languages, citizenship status, education status, occupational status, dwelling type, ethnicity, and length of residence of the participant.

The psychological questionnaires administered were (1) a shortened, $10$ -item version of the Weinstein Noise Sensitivity Scale (WNSS-10) [87]; (2) a shortened, $10$ -item version of the Perceived Stress Scale (PSS-10) [88]; (3) the WHO-5 Well-Being Index [89]; and (4) the Positive and Negative Affect Schedule (PANAS) [90].

The relevance of the questionnaires is evident in the fact that noise sensitivity [91] and stress level [92] play a significant role in the affective perception of soundscapes, the WHO-5 Well-Being Index is used in the Soundscape Indices Protocol (SSID) [53], and [93] previously used PANAS as a measure of participants’ mood in a study of the emotional salience of sounds in soundscapes. The PSS-10 has been validated by the same authors of the original 14-item questionnaire [94], but the particular version (WNSS-10) of the Weinstein Noise Sensitivity Scale used for the ARAUS dataset has not previously been validated in the literature. Nonetheless, the internal reliability of WNSS-10 is affirmed based on the results of the analysis obtained in Section 4.2.

Full details of the PIQ with individual questions, options, and response coding for each section of the questionnaire can be found in Appendix B.

3.9 Consistency Checks

In order to ensure a baseline level of data quality in the responses to the ARQ, seven consistency checks were performed on each participant’s responses. The consistency checks were designed as single-value metrics, and a participant was considered to have failed a consistency check if the corresponding value of the metric was at least $1$ . If a participant failed more than three out of seven consistency checks, their responses were dropped from the dataset.

To describe the single-value metrics, we first define $r_{\text{pl}},r_{\text{ev}},r_{\text{ch}},r_{\text{vi}},r_{\text{ue}},r_{\text% {ca}},r_{\text{an}},r_{\text{mo}},r_{\text{ap}}\in\{1,2,3,4,5\}$ as the responses to the ARQ in Section 3.7 regarding the extent to which the sound environment in a given stimulus was respectively pleasant, eventful, chaotic, vibrant, uneventful, calm, annoying, monotonous, and appropriate.

The seven consistency checks consist of three types of checks. The first check measures the mean absolute difference (MAD) between ARQ responses on the “pre-experiment” and “post-experiment” stimuli. Since the “pre-experiment” and “post-experiment” stimuli were identical, a perfectly consistent participant would have provided the same responses to both presentations.

Since the affective descriptors “pleasant” and “annoying” are on opposite axes of the ISO/TS 12913-3:2019 circumplex model of soundscape perception [11], a perfectly consistent participant would have provided the same response for “pleasant” and “annoying” if the Likert coding of either were to be reversed. The same considerations apply to the other three pairs of opposite attributes on the circumplex model. Therefore, the next four checks consider the MADs between the four pairs of $r_{p}$ and $(6-r_{q})$ , where $(p,q)$ is a pair of opposite descriptors.

Lastly, the mean squared error (MSE) between $r_{\text{pl}}$ and $(3+2P)$ , and that between $r_{\text{ev}}$ and $(3+2E)$ across all stimuli presented was computed, where

	$\displaystyle P$	$\displaystyle=k^{-1}\left(\sqrt{2}r_{\text{pl}}-\sqrt{2}r_{\text{an}}+r_{\text% {ca}}-r_{\text{ch}}+r_{\text{vi}}-r_{\text{mo}}\right),$		(1)
	$\displaystyle E$	$\displaystyle=k^{-1}\left(\sqrt{2}r_{\text{ev}}-\sqrt{2}r_{\text{ue}}-r_{\text% {ca}}+r_{\text{ch}}+r_{\text{vi}}-r_{\text{mo}}\right),$		(2)

respectively are the normalized values of “ISO Pleasantness” and “ISO Eventfulness” as suggested in [95], and $k=8+\sqrt{32}$ is a normalization constant such that $P,E\in[-1,1]$ . Since the affective descriptor “pleasant” theoretically parallels the principal axis of “Pleasantness” in the ISO/TS 12913-3:2019 circumplex model of soundscape perception, a perfectly consistent participant would have provided responses matching the magnitude in both directions. The rescaling of $P$ as $(3+2P)$ was necessary to match the range of values of $r_{\text{pl}}$ for a valid comparison. Similar considerations apply for the affective descriptor “eventful” to the principal axis “Eventfulness”.

4 Data Analysis

After the data collection described in Section 3 had been completed, we performed analyses to verify data quality and empirical consistency with known literature.

Firstly, the ARQ responses were analyzed to compare the effect of different maskers on the normalized ISO Pleasantness $P$ of the augmented soundscapes, and validate that the stimuli in the ARAUS dataset spanned the perceptual space generated by the ISO Pleasantness and ISO Eventfulness axes. Statistical and internal reliability tests were then performed on the PIQ responses and consistency check metrics to ensure that the distribution of data and responses in each fold of the cross-validation set did not significantly differ. This allowed us to assess the degree to which the methodological efforts to minimize domain shift between folds were successful.

4.1 Affective Response Questionnaire

To investigate the effect each masker had on the ISO Pleasantness of the urban soundscapes it was augmented to, we first used the ARQ responses to compute the ISO Pleasantness values for all the stimuli in the ARAUS dataset using Equation (1). The difference in ISO Pleasantness for each base urban soundscape (i.e., augmented with silence) and the same urban soundscape augmented with each masker was computed. These differences were averaged across all soundscapes for which the same masker, regardless of the SMR, was presented to obtain the mean change in ISO Pleasantness effected by each masker across different soundscapes. This allowed us to determine which maskers were optimal for augmentation, at least on average in a naive sense across the urban soundscapes in the ARAUS dataset.

Figure 5 shows the mean change in ISO Pleasantness value as a function of masker used to augment soundscape, aggregated over soundscapes and SMRs used. Out of the $287$ maskers in the ARAUS dataset, mean positive changes in ISO Pleasantness were only observed in maskers belonging to the bird and water classes. However, not all of the bird and water maskers showed mean positive changes, corroborating the findings by [92] that not all natural sounds are necessarily perceived as pleasant. Nevertheless, augmentation with $64$ ( $78.0\text{\,}\mathrm{\char 37\relax}$ ) of the bird maskers and $16$ ( $19.5\text{\,}\mathrm{\char 37\relax}$ ) of the water maskers, each out of $82$ , resulted in effective mean positive changes in ISO Pleasantness, which supports findings in [14, 54, 70] where specific bird and water sounds were found to have improved the perceived pleasantness of urban soundscapes.

In contrast, mean negative changes in ISO Pleasantness were observed for all wind, traffic, and construction maskers in the ARAUS dataset. While the decrease in ISO Pleasantness due to the addition of traffic [96] and construction [97] maskers is expected, the decrease in ISO Pleasantness for all wind sounds used as maskers was contrary to the results in narrative interviews reported by [92] that people tended to perceive wind rustling through trees as pleasant. This could be due to the range of SMRs used for the ARAUS dataset, which ranged only from $-6$ to $+6\text{\,}\mathrm{dB}$ . The mean in-situ $L_{A,eq}$ of the tracks in the USotW database used for the ARAUS dataset was about $65\text{\,}\mathrm{dB}$ , but natural wind sounds in real-life environments tend to have a mean $L_{A,eq}$ of about $55\text{\,}\mathrm{dB}$ or less [98], which means that an SMR of $+10\text{\,}\mathrm{dB}$ or higher may have been more appropriate to achieve an increase in perceived pleasantness for wind maskers instead. At the SMRs used for the ARAUS dataset, the wind maskers may have been added at overly high levels, rendering them to be perceived similarly to the traffic maskers due to their similar spectral characteristics and potentially contributing to the decrease in ISO Pleasantness. Additionally, the laboratory-based nature of the data collection process caused the wind maskers to be heard without perceiving natural movements of the air that would be present in an in-situ study, which could have resulted in an artificial or unpleasant situation for the participants.

To investigate how well the ARAUS dataset stimuli and the base urban soundscapes from the USotW database spanned the perceptual space generated by the “Pleasantness–Eventfulness” axes, we first computed the normalized values of $P$ and $E$ according to Equations (1) and (2) for each individual ARQ response in the ARAUS dataset. Then, we plotted a heat map and scatter plot of the responses on the $P$ – $E$ axes, as shown in Fig. 6. We can see that both the ARAUS dataset stimuli and the USotW soundscapes covered the positive, neutral, and negative regions of both the ISO Pleasantness and the ISO Eventfulness axes, thereby validating their use in analysis related to the ISO 12913 standard.

4.2 Participant Information Questionnaire

Since items in the PIQ such as the participant’s age [99] and education status [100] can affect the affective perception of soundscapes, the participants in each fold of the cross-validation set of the ARAUS dataset should ideally be drawn from distributions with similar demographics. This was not enforced during the participant recruitment and fold allocation process, thus a post hoc analysis of the PIQ responses was performed to assess the demographic distributions. The post hoc analysis was performed via standard statistical tests for equality of distributions.

For items in the PIQ coded as categorical variables, $\chi^{2}$ -tests were conducted between the responses obtained in each fold of the cross-validation set (treated as observed frequencies) and those obtained in the entire cross-validation set (treated as expected frequencies). No significant differences were observed, at $5\text{\,}\mathrm{\char 37\relax}$ significance levels, in the distribution of all categorical variables between folds, with the lowest $p$ -value being $0.1070$ for the participants’ gender.

For items in the PIQ coded as continuous variables, Kruskal-Wallis tests were conducted by treating each fold as an independent group. The Kruskal-Wallis tests showed no significant differences at $5\text{\,}\mathrm{\char 37\relax}$ significance levels in the distribution of all continuous variables between folds, with the lowest $p$ -value being $0.3090$ for the extent to which participants were annoyed by noise over the past $12$ months.

Together, these results indicate that the distribution of participants in a given fold did not significantly differ from the participants in any other fold, which reinforces the validity of using the five-fold cross-validation set of the ARAUS dataset to generate models for populations sharing similar characteristics to that of the entirety of the cross-validation set. Full results of the statistical tests and explicit distributions of responses by fold and PIQ item are detailed in Appendix C.

For the psychological questionnaires used in the PIQ (WNSS-10, PSS-10, WHO-5, and PANAS), Cronbach’s $\alpha$ [101] and McDonald’s $\omega$ [102] were computed as standard internal reliability coefficients to independently verify if their items in aggregate were indeed measuring the same construct. This served as a validation study for WNSS-10 and a replication study for the other questionnaires.

The internal reliability study results are shown in Table IV. All values of the internal reliability coefficients are above 0.8, which can be considered “high” given the number of items in each questionnaire [103]. Therefore, all items in each questionnaire indeed measured the same construct. In particular, the Cronbach’s $\alpha$ value for of $0.835$ for WNSS-10 parallels similar validation studies for different iterations of the questionnaires of the original $21$ -item version [104], thereby confirming the reliability of the previously unvalidated version used for the ARAUS dataset.

TABLE IV: Reliability metrics for psychological questionnaires in the PIQ. Larger values are desirable, as denoted by the up arrow

(\uparrow)

Questionnaire	Cronbach’s $\alpha\hskip 2.84526pt(\uparrow)$	McDonald’s $\omega\hskip 2.84526pt(\uparrow)$
WNSS-10	0.835	0.837
PSS-10	0.875	0.874
WHO-5	0.854	0.857
PANAS (Positive)	0.886	0.891
PANAS (Negative)	0.891	0.891

4.3 Consistency Checks

In a similar manner to the PIQ, Kruskal-Wallis tests for each of the seven single-value consistency metrics were performed by treating the ARQ responses in each fold as independent groups. Full results are presented in Appendix C.

No significant differences were observed, at $5\text{\,}\mathrm{\char 37\relax}$ significance levels, in the distribution of all continuous variables between folds, with the lowest $p$ -value being $0.2545$ for the MAD between “vibrant” and reversed “monotonous” ratings. This indicates that the distribution of response consistency in a given fold did not significantly differ from the responses in any other fold, reinforcing the validity of using the five-fold cross-validation set of the ARAUS dataset to generate generalizable and unbiased models, at least from the perspective of consistency with the ISO 12913-3:2019 circumplex model of soundscape perception.

5 Affective Response Modeling

To illustrate how the ARAUS dataset can be used for systematic and fair benchmarking of models for affective perception, four models were trained to predict the ISO Pleasantness, as defined in Equation (1), of a given augmented soundscape using the ARAUS dataset. The models were a relatively low-parameter regression model, a convolutional neural network (CNN), and two different probabilistic perceptual attribute predictor (PPAP) models [105, 106]. A dummy model that always predicts the mean of the training data labels was additionally used as a naive benchmark.

Using a five-fold cross-validation scheme, each model was trained five times, each with a different fold of the cross-validation set used as the validation set ( $5040$ samples) and the remaining four folds used as the training set ( $20\,160$ samples). For models sensitive to random initialization, results from $10$ different seeds per validation fold were recorded, totalling $50$ runs. After training, each model was evaluated on the independent test set ( $48$ samples).

5.1 Models

5.1.1 Regularized Linear Regression

The linear regression models used acoustic and psychoacoustic indicators of an augmented soundscape as input features to predict its ISO Pleasantness. Firstly, the statistics detailed in Section 3.3.2 were computed for each binaural augmented soundscape in the ARAUS dataset, which gave $264$ possible input features to be used for regression. The features, without prior dimensionality reduction, were used to train elastic net models [107] with $L_{1}$ and $L_{2}$ regularization weights of $0.5$ and $0.25$ , respectively.

Elastic net models are designed for parameter sparsity, since the elastic net loss function will cause most weights to be set to zeroes after training concludes. This indirectly allows the method to automatically choose suitable parameters for regression from an initial larger selection, and makes it more computationally efficient than stepwise regression.

5.1.2 Convolutional Neural Network

The CNN models were trained to predict the ISO Pleasantness of an augmented soundscape using its log-mel spectrogram. Firstly, the channel-wise log-mel spectrograms were computed for each binaural augmented soundscape in the ARAUS dataset, with a Hann window length of $4096$ samples, $50\text{\,}\mathrm{\char 37\relax}$ overlap between windows, and $64$ mel frequency bands from $20\text{\,}\mathrm{Hz}$ to $20\text{\,}\mathrm{kHz}$ .

The log-mel spectrograms (as $644$ -by- $64$ -by- $2$ tensors) were then used as input to the CNN models with the model architecture shown in Fig. 7. The architecture is identical to the baseline model architecture used for the acoustic scene classification task (Task 1B) of the DCASE 2020 Challenge [41]. However, due to the slightly larger input dimensions of the ARAUS data, the modified models shown in Fig. 7 contained about 142K parameters, compared to the original 116K in [41], with the only difference being the input dimensions of the final dense layer.

The CNN models were trained over $100$ epochs with an Adam optimizer [108] with learning rate $1\text{\times}{10}^{-4}$ and batch size $32$ . The training was stopped early if the validation set MSE did not improve for $10$ consecutive epochs. This process was repeated for $10$ runs per validation fold to obtain the mean performance of the models.

5.1.3 Probabilistic Perceptual Attribute Predictor

The PPAP models were used to model the subjectivity in ARQ responses by outputting predictions for ISO Pleasantness based on probability distributions rather than deterministic values. As described in [105, 106], we trained the PPAP models using the normal distribution $N(\mu,\sigma^{2})$ , with the loss function being the log-probability of the ground-truth response being observed, given the output distribution parameters $\mu$ and $\sigma$ .

Using the ARAUS dataset responses, we trained two different variations of the PPAP models: one performing augmentation in the time domain (i.e., taking in log-mel spectrograms of the augmented soundscapes as initial input) [105], and one performing augmentation in the feature domain (i.e., taking in log-mel spectrograms of base urban soundscapes and maskers separately as initial inputs, and subsequently performing augmentation on features extracted from them) [106], both with multi-head attention in the feature mapping blocks. The parameters for the computation of the log-mel spectrograms and the training procedure for the PPAP models were similar to that of the CNN models.

5.2 Results and Discussion

TABLE V: Mean squared error values (

\pm

standard deviation over 10 runs where applicable) for models described in Section 5. Smaller values are desirable, as denoted by the down arrow

(\downarrow)

Model	Params	Train			Validation			Test
		Mean squared error $(\downarrow)$
Dummy (mean of labels)	—	0.1551			0.1553			0.1192
Elastic net	4	0.1351			0.1357			0.0930
CNN based on DCASE 2020 Task 1B baseline [41]	142K	0.1152	$\pm$	0.0020	0.1212	$\pm$	0.0010	0.0865	$\pm$	0.0063
Multi-head attention PPAP, additive time-domain augmentation [105]	123K	0.1098	$\pm$	0.0047	0.1216	$\pm$	0.0016	0.0889	$\pm$	0.0064
Multi-head attention PPAP, additive feature-domain augmentation [106]	515K	0.1086	$\pm$	0.0036	0.1238	$\pm$	0.0022	0.0838	$\pm$	0.0103

The label means for the normalized ISO Pleasantness $P$ were $0.0208$ , $0.0317$ , $0.0307$ , $0.0281$ , and $0.0225$ when folds $1$ , $2$ , $3$ , $4$ , and $5$ were respectively left out as the validation fold for the derivation of the dummy models. Hence, these were also the exact predictions given by all dummy models regardless of the soundscape presented. Considering that $P\in[-1,1]$ , the difference in label means by fold is also insubstantial, since they are within $0.5\text{\,}\mathrm{\char 37\relax}$ of the full range. In fact, the distributions of the training set labels themselves have no significant difference by fold, as shown in Fig. 8.

After training the elastic net models, the weights for all but three input features were set to zero regardless of the fold used as the validation set. Taking the mean of weights for the regression models across the 5-fold cross-validation set, this gave the general regression model

\widehat{P}=\begin{bmatrix}[l]+0.3035\\ -7.8731\cdot 10^{-3}\\ -0.4512\cdot 10^{-3}\\ -1.7326\cdot 10^{-3}\end{bmatrix}^{\mathsf{T}}\begin{bmatrix}[l]1\\ N_{\text{max}}\\ M_{\text{$10\text{\,}\mathrm{kHz}$}}\\ M_{\text{$12.5\text{\,}\mathrm{kHz}$}}\end{bmatrix},

(3)

where $N_{\text{max}}$ is the maximum loudness (in sone), $M_{f}$ is the power (in $\text{\,}\mathrm{dB}$ ) at the one-third octave band with center frequency $f$ , the prediction for the normalized ISO Pleasantness value $P$ is denoted as $\widehat{P}$ , and the regression coefficients are not normalized.

This implies that of the $264$ acoustic and psychoacoustic parameters used as input features to train the regression models, the features that most significantly impacted the ISO Pleasantness were $N_{\text{max}}$ , $M_{\text{$10\text{\,}\mathrm{kHz}$}}$ and $M_{\text{$12.5\text{\,}\mathrm{kHz}$}}$ , at least based on the responses in the ARAUS dataset. Incidentally, for the stimuli in the ARAUS dataset, we observed that the ranges of these features were $N_{\text{max}}\in[8.36,94.1]$ , $M_{\text{$10\text{\,}\mathrm{kHz}$}}\in[29.62,85.15]$ and $M_{\text{$12.5\text{\,}\mathrm{kHz}$}}\in[32.51,78.67]$ , which meant that the predictions had a range of $\widehat{P}\in[-0.6121,0.1680]$ . The negative coefficients of the features in Equation (3) also suggest that increasing the peak loudness and high-frequency components of an augmented soundscape correlate with a decrease in its perceived pleasantness, corroborating the respective findings by [109] and [110] regarding general acoustic environments.

Lastly, Table V compares the performance of the four baseline models and the dummy model with respect to the MSEs achieved across their training, validation, and test sets in the $5$ -fold cross-validation scheme used to train them. All baseline models performed better than the dummy model in the training, validation, and test sets, which suggests that the features trained in all the baseline models to make ISO Pleasantness predictions were indeed meaningful.

In addition, the CNN and PPAP models performed significantly better than the elastic net models, with the lowest mean train, validation, and test set MSE of $0.1086$ , $0.1212$ , and $0.0838$ observed with the PPAP performing feature-domain augmentation [106], the CNN [41], and the PPAP performing feature-domain augmentation [106], respectively. This is expected due to the vast difference in the numbers of parameters used by the CNN and PPAP models (142–515K) for prediction, in comparison to the regression models ( $4$ parameters) and dummy model. Nevertheless, none of the models were overfitted to the dataset, as can be seen from the relatively similar values for training and validation set MSEs across all models, which demonstrates the versatility of the ARAUS dataset for use in training generalizable high-parameter models.

Notably, the test set MSE values are also smaller than that of the validation set, which could be due to the relative size of the test set ( $48$ samples) as compared to the validation set ( $5040$ samples) making the test set “easier” from the perspective of the prediction models than their respective validation sets.

6 Limitations and Future Work

While the ARAUS dataset utilized recordings of in-situ urban soundscapes as part of its stimuli, the laboratory-based data collection process means that models trained using the ARAUS dataset may require confirmation of their ecological validity with follow-up in-situ experiments or soundwalks. The range of SMRs used for the generation of the augmented soundscapes in the ARAUS dataset (from $-6$ to $+6\text{\,}\mathrm{dB}$ ) is also a potential limitation of the dataset, since real-life or virtual sound sources used for augmentation in general could have relative differences outside of the chosen range, so future data collection related to the ARAUS dataset could consider increasing the range of SMRs used for the generation of stimuli, which could be done by changing input parameter settings in the provided replication code. Not all combinations of urban soundscapes, maskers, and SMRs were exhaustively generated for the current iteration of the ARAUS dataset, so responses could also be collected exhaustively in the future to make the dataset fully comprehensive. Responses could alternatively be collected continuously over time, instead of just once after the presentation of each stimulus, in order to study the consistency of participants’ affective responses over time. The visual content of the stimuli could also be varied, such as by using video recordings captured at completely different locations, for the same audio recording, in order to investigate changes in affective responses due purely to changes in the visual content.

Moreover, since the participants in the ARAUS dataset were mostly young university students, the results and analysis obtained may not necessarily translate equivalently to a more general population of people exposed to urban soundscapes. The inclusion of only five participants in the test set, as compared to $120$ in each fold of the cross-validation set, is also a primary limitation of the dataset. At the scale of the test set, person-to-person differences in perception could dominate any other factor contributing to perception, thereby causing the test set to serve more as a small focus group of participants rather than a general sample representative of the same population as the cross-validation set. Hence, future extensions to the ARAUS dataset could concentrate on enlarging the test set and obtaining responses from participants from an older demographic, to allow for improved benchmarking of models trained using the dataset.

The models described in this article were also trained using the individual augmented soundscapes in the ARAUS dataset in isolation from each other, so they did not account for the effects of temporal successions of different maskers, and hence different augmented soundscapes. Naïvely applying them to choose a time series of optimal maskers for an extended duration of time could lead to a dissonant succession of maskers that may inadvertently result in an unpleasant augmented soundscape overall, despite the individual maskers being predicted as optimal for individual time windows. Further work on these models could thus look into generalizing them to extended durations of time. The models could also be fine-tuned via transfer learning methods on the affective soundscape datasets in Table I (which possess labels of a different nature), and be compared with other benchmarks to assess the amenability of the ARAUS dataset to transfer learning.

Lastly, other perceptual indicators other than those defined in ISO/TS 12913-2:2018, such as the perceived loudness or tranquility, could also be used as part of the ARQ to expand its scope and allow the ARAUS dataset to possess greater generalizability.

7 Conclusion

In conclusion, we presented the Affective Responses to Augmented Urban Soundscapes (ARAUS) dataset, which functions as a benchmark dataset for comparisons of prediction models for the affective perception of urban and augmented soundscapes. We first presented a systematic methodology for the collection of data, which can be replicated or extended upon. Subsequently, we analyzed the responses obtained for the ARAUS dataset, and provided benchmark models for predicting the perceptual attribute of ISO Pleasantness. To the best of our knowledge, the ARAUS dataset is currently the largest soundscape dataset with perceptual labels, but is not without its inherent limitations as described in Section 6.

Nonetheless, we hope that the ARAUS dataset becomes a beneficial and enduring resource for the soundscape community, by assisting soundscape researchers in developing more accurate, robust models for soundscape perception.

Acknowledgments

We would like to thank Prof. Chen Jer-Ming, Mr. Tan Yi Xian, and Ms. Cindy Lin for assisting with the administrative arrangements and setup of the test site at the Singapore University of Technology and Design.

This research is supported by the Singapore Ministry of National Development and the National Research Foundation, Prime Minister’s Office under the Cities of Tomorrow Research Programme (Award No. COT-V4-2020-1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the view of National Research Foundation, Singapore, and Ministry of National Development, Singapore.

References

[1] M. Raimbault, C. Lavandier, and M. Bérengier, “Ambient sound assessment of urban environments: Field studies in two French cities,” Applied Acoustics, vol. 64, no. 12, pp. 1241–1256, 2003.
[2] P. Jennings and R. Cain, “A framework for improving urban soundscapes,” Applied Acoustics, vol. 74, no. 2, pp. 293–299, 2013.
[3] K. M. De Paiva Vianna, M. R. Alves Cardoso, and R. M. C. Rodrigues, “Noise pollution and annoyance: An urban soundscapes study,” Noise and Health, vol. 17, no. 76, pp. 125–133, 2015.
[4] J. Kang, et al., “Towards soundscape indices,” in 23rd International Congress on Acoustics, 2019, pp. 2488–2495.
[5] H. M. E. Miedema and H. Vos, “Exposure-response relationships for transportation noise,” The Journal of the Acoustical Society of America, vol. 104, no. 6, pp. 3432–3445, 1998.
[6] B. De Coensel and D. Botteldooren, “Models for soundscape perception and their use in planning,” in Proceedings of Inter-Noise 2007, 2007.
[7] O. Axelsson, M. E. Nilsson, B. Hellström, and P. Lundén, “A field experiment on the impact of sounds from a jet-and-basin fountain on soundscape quality in an urban park,” Landscape and Urban Planning, vol. 123, pp. 49–60, 2014.
[8] Z. Abdalrahman and L. Galbrun, “Audio-visual preferences, perception, and use of water features in open-plan offices,” Journal of the Acoustical Society of America, vol. 147, no. 3, pp. 1661–1672, 2020.
[9] J. Y. Hong, et al., “The effects of spatial separations on water sound and traffic noise sources on soundscape assessment,” Building and Environment, vol. 167, no. 106423, 2020.
[10] F. Aletta and J. Kang, “Towards an urban vibrancy model: A soundscape approach,” International Journal of Environmental Research and Public Health, vol. 15, no. 8, 2018.
[11] International Organization for Standardization, ISO 12913-3:2019 - Acoustics - Soundscape - Part 3: Data analysis. Geneva, Switzerland: International Organization for Standardization, 2019.
[12] T. Van Renterghem, et al., “Interactive soundscape augmentation by natural sounds in a noise polluted urban park,” Landscape and Urban Planning, vol. 194, no. October 2019, p. 103705, 2020.
[13] T. Wong, et al., “Deployment of an IoT System for Adaptive In-Situ Soundscape Augmentation,” in Proceedings of Inter-Noise 2022, 2022.
[14] T. M. Leung, C. K. Chau, and S. K. Tang, “On the study of effects on different types of natural sounds on the perception of combined sound environment with road traffic noise,” in Proceedings of Inter-Noise 2016, 2016, pp. 1764–1770.
[15] J. Y. Hong, et al., “A mixed-reality approach to soundscape assessment of outdoor urban environments augmented with natural sounds,” Building and Environment, vol. 194, no. July 2020, p. 107688, 2021.
[16] P. Aumond, A. Can, B. De Coensel, D. Botteldooren, C. Ribeiro, and C. Lavandier, “Modeling soundscape pleasantness using perceptual assessments and acoustic measurements along paths in urban context,” Acta Acustica united with Acustica, vol. 103, no. 3, pp. 430–443, 2017.
[17] V. Puyana-Romero, G. Ciaburro, G. Brambilla, C. Garzón, and L. Maffei, “Representation of the soundscape quality in urban areas through colours,” Noise Mapping, vol. 6, no. 1, pp. 8–21, 2019.
[18] F. Aletta, et al., “Soundscape assessment: Towards a validated translation of perceptual attributes in different languages,” in Proceedings of Inter-Noise 2020, 2020.
[19] M. Lionello, F. Aletta, and J. Kang, “A systematic review of prediction models for the experience of urban soundscapes,” Applied Acoustics, vol. 170, p. 107479, 2020.
[20] International Organization for Standardization, ISO 12913-2 Acoustics - Soundscape - Part 2: Data collection and reporting requirements. Geneva, Switzerland: International Organization for Standardization, 2018.
[21] S. Wang, T. Heittola, A. Mesaros, and T. Virtanen, “Audio-visual scene classification: analysis of DCASE 2021 Challenge submissions,” in Proceedings of DCASE 2021 Workshop, 2021, pp. 45–49.
[22] I. Martín-Morató, T. Heittola, A. Mesaros, and T. Virtanen, “Low-complexity acoustic scene classification for multi-device audio: Analysis of DCASE 2021 challenge systems,” in Proceedings of DCASE 2021 Workshop, 2021, pp. 85–89.
[23] A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, and T. Virtanen, “A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection,” in Proceedings of DCASE 2021 Workshop, 2021, pp. 125–129.
[24] Y. Kawaguchi, et al., “Description and Discussion on DCASE 2021 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions,” in Proceedings of DCASE 2021 Workshop, 2021, pp. 186–190.
[25] J. F. Gemmeke, et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in Proceedings of IEEE ICASSP 2017, 2017, pp. 776–780.
[26] S. Hershey, et al., “The benefit of temporally-strong labels in audio event classification,” in Proceedings of IEEE ICASSP 2021, 2021, pp. 366–370.
[27] ——, “AudioSet: Temporally-Strong Labels Download (May 2021),” 2021. [Online]. Available: https://research.google.com/audioset/download˙strong.html
[28] ——, “CNN architectures for large-scale audio classification,” in Proceedings of IEEE ICASSP 2017, 2017, pp. 131–135.
[29] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 2014 ACM Multimedia Conference, 2014, pp. 1041–1044.
[30] K. J. Piczak, “ESC: Dataset for environmental sound classification,” in Proceedings of the 2015 ACM Multimedia Conference, 2015, pp. 1015–1018.
[31] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An Open Dataset of Human-Labeled Sound Events,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 30, pp. 829–852, 2022.
[32] F. Font, G. Roma, and X. Serra, “Freesound Technical Demo,” in Proceedings of the 2013 ACM Multimedia Conference, 2013, pp. 411–412.
[33] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: a library for soundscape synthesis and augmentation,” 2017, pp. 344–348.
[34] B. De Coensel, K. Sun, and D. Botteldooren, “Urban Soundscapes of the World: Selection and reproduction of urban acoustic environments with soundscape in mind,” in Proceedings of Inter-Noise 2017, 2017.
[35] M. Ciufo and D. Thomas, “EigenScape: A Database of Spatial Acoustic Scene Recordings,” Applied Sciences, vol. 7, 2017.
[36] M. Cartwright, et al., “SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network,” in Proceedings of DCASE 2019 Workshop, 2019.
[37] ——, “SONYC-UST-V2: An Urban Sound Tagging Dataset with Spatiotemporal Context,” in Proceedings of DCASE 2020 Workshop, 2020, pp. 16–20.
[38] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic scene classification and sound event detection,” in Proceedings of EUSIPCO 2016, 2016, pp. 1128–1132.
[39] A. Mesaros, et al., “DCASE 2017 challenge setup: tasks, datasets and baseline system,” in Proceedings of DCASE 2017 Workshop, 2017.
[40] A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” in Proceedings of DCASE 2018 Workshop, 2018.
[41] T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions,” in Proceedings of DCASE 2020 Workshop, 2020.
[42] S. Wang, A. Mesaros, T. Heittola, and T. Virtanen, “A curated dataset of urban scenes for audio-visual scene analysis,” in Proceedings of IEEE ICASSP 2021, 2021, pp. 626–630.
[43] K. Ooi, et al., “A Strongly-Labelled Polyphonic Dataset of Urban Sounds with Spatiotemporal Context,” in Proceedings of APSIPA ASC 2021, 2021.
[44] A. Weisser, et al., “The Ambisonic Recordings of Typical Environments (ARTE) database,” Acta Acustica united with Acustica, vol. 105, no. 4, pp. 695–713, 2019.
[45] A. Politis, et al., “STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” 2022. [Online]. Available: https://doi.org/10.5281/zenodo.6387880
[46] M. M. Bradley and P. J. Lang, “The International Affective Digitized Sounds (2nd Edition; IADS-2): Affective ratings of sounds and instruction manual. Technical report B-3.” University of Florida, Gainesville, Fl., Tech. Rep., 2007.
[47] R. A. Stevenson and T. W. James, “Affective auditory stimuli: Characterization of the International Affective Digitized Sounds (IADS) by discrete emotional categories,” Behavior Research Methods, vol. 40, no. 1, pp. 315–321, 2008.
[48] W. Yang, et al., “Affective auditory stimulus database: An expanded version of the International Affective Digitized Sounds (IADS-E),” Behavior Research Methods, vol. 50, no. 4, pp. 1415–1429, 2018.
[49] J. Fan, M. Thorogood, and P. Pasquier, “Emo-soundscapes: A dataset for soundscape emotion recognition,” in Proceedings of ACII 2017, 2017, pp. 196–201.
[50] M. M. Bradley and P. J. Lang, “Measuring emotion: The self-assessment manikin and the semantic differential,” Journal of Behavior Therapy and Experimental Psychiatry, vol. 25, no. 1, pp. 49–59, 1994.
[51] T. Giannakopoulos, M. Orfanidi, and S. Perantonis, “Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition,” in Proceedings of 25th International Conference on Multimedia Modeling, 2019, pp. 338–348.
[52] A. Mitchell, et al., “The International Soundscape Database: An integrated multimedia database of urban soundscape surveys – questionnaires with acoustical and contextual information,” 2021. [Online]. Available: https://doi.org/10.5281/Zenodo.5914762
[53] ——, “The Soundscape Indices (SSID) Protocol : A Method for Urban Soundscape Surveys — Questionnaires with Acoustical and Contextual Information,” Applied Sciences, vol. 10, no. 2397, pp. 1–27, 2020.
[54] Y. Hao, J. Kang, and H. Wortche, “Assessment of the masking effects of birdsong on the road traffic noise environment,” Journal of the Acoustical Society of America, vol. 140, no. 2, pp. 978–987, 2016.
[55] T. M. Leung, C. K. Chau, S. K. Tang, and J. M. Xu, “Developing a multivariate model for predicting the noise annoyance responses due to combined water sound and road traffic noise exposure,” Applied Acoustics, vol. 127, pp. 284–291, 2017.
[56] F. Aletta, T. Oberman, A. Mitchell, H. Tong, and J. Kang, “Assessing the changing urban sound environment during the COVID-19 lockdown period using short-term acoustic measurements,” Noise Mapping, vol. 7, no. 1, pp. 123–134, 2020.
[57] M. E. Lutman, “What is the risk of noise-induced hearing loss at 80, 85, 90 dB(A) and above?” Occupational Medicine, vol. 50, no. 4, pp. 274–275, 2000.
[58] B. Planqué and W.-P. Vellinga, “Xeno-canto: a 21st century way to appreciate Neotropical bird song,” Neotropical Birding, vol. 3, no. January, pp. 17–23, 2008.
[59] J. Y. Jeon, P. J. Lee, J. You, and J. Kang, “Acoustical characteristics of water sounds for soundscape enhancement in urban open spaces,” Journal of the Acoustical Society of America, vol. 131, no. 3, pp. 2101–2109, 2012.
[60] L. Galbrun and T. T. Ali, “Acoustical and perceptual assessment of water sounds and their use over road traffic noise,” Journal of the Acoustical Society of America, vol. 133, no. 1, pp. 227–237, 2013.
[61] B. De Coensel, S. Vanwetswinkel, and D. Botteldooren, “Effects of natural sounds on the perception of road traffic noise,” JASA Express Letters, vol. 129, no. 4, pp. 148–153, 2011.
[62] D. M. Ferraro, et al., “The phantom chorus: birdsong boosts human well-being in protected areas,” Proceedings of the Royal Society B, vol. 287, no. 1941, 2020.
[63] M. Hedblom, I. Knez, Ode Sang, and B. Gunnarsson, “Evaluation of natural sounds in urban greenery: Potential impact for urban nature preservation,” Royal Society Open Science, vol. 4, no. 2, 2017.
[64] World Health Organization Regional Office for Europe, Environmental Noise Guidelines for the European Region. Copenhagen: The Regional Office for Europe of the World Health Organization, 2018.
[65] J. You and J. Y. Jeon, “Sound-masking technique for combined noise exposure in open public spaces,” in Proceedings of ICBEN 2008, 2008.
[66] N. Pieretti and A. Farina, “Application of a recently introduced index for acoustic complexity to an avian soundscape with traffic noise,” Journal of the Acoustical Society of America, vol. 134, no. 1, pp. 891–900, 2013.
[67] X. Lu, J. Tang, P. Zhu, F. Guo, J. Cai, and H. Zhang, “Spatial variations in pedestrian soundscape evaluation of traffic noise,” Environmental Impact Assessment Review, vol. 83, 2020.
[68] C. Xu and J. Kang, “Soundscape evaluation: Binaural or monaural?” Journal of the Acoustical Society of America, vol. 145, no. 5, pp. 3208–3217, 2019.
[69] M. S. Engel, A. Fiebig, C. Pfaffenbach, and J. Fels, “A Review of the Use of Psychoacoustic Indicators on Soundscape Studies,” Current Pollution Reports, vol. 7, no. 3, pp. 359–378, 2021.
[70] J. Y. Hong, et al., “Effects of adding natural sounds to urban noises on the perceived loudness of noise and soundscape quality,” Science of the Total Environment, vol. 711, 2020.
[71] German Institute for Standardization, DIN 45692: Measurement technique for the simulation of the auditory sensation of sharpness. Beuth Verlag GmbH, 2009.
[72] International Organization for Standardization, ISO 532-1: Acoustics - Methods for calculating loudness - Part 1: Zwicker method, Geneva, 2014.
[73] H. Fastl and E. Zwicker, Psychoacoustics - Facts and Models, T. S. Huang, M. R. Schroeder, and T. Kohonen, Eds. Springer, 2001.
[74] Ecma International, ECMA-418-2:2020 - Psychoacoustic metrics for ITT equipment - Part 2 (models based on human perception), 1st ed., Geneva, Switzerland, 2020.
[75] ——, ECMA-74 - Acoustics - Measurement of airborne noise emitted by information technology and telecommunications equipment, 19th ed., Geneva, Switzerland, 2021.
[76] International Organization for Standardization, ISO 1996-1:2016 Acoustics — Description , measurement and assessment of environmental noise — Part 1: Basic quantities and assessment procedures. Geneva: International Organization for Standardization, 2016.
[77] R. San Millán-Castillo, E. Latorre-Iglesias, D. Jiménez-Caminero, J. M. Álvarez-Jimeno, M. Glesser, and S. Wanty, “MOSQITO: An open-source and free toolbox for sound quality metrics in the industry and education,” in Proceedings of Inter-Noise 2021, 2021.
[78] C. Flowers, F. M. Le Tourneau, N. Merchant, B. Heidorn, R. Ferriere, and J. Harwood, “Looking for the -scape in the sound: Discriminating soundscapes categories in the Sonoran Desert using indices and clustering,” Ecological Indicators, vol. 127, 2021.
[79] T. Kohonen, Self-organizing maps. Springer-Verlag Berlin Heidelberg, 2001.
[80] K. Ooi, Y. Xie, B. Lam, and W. S. Gan, “Automation of binaural headphone audio calibration on an artificial head,” MethodsX, vol. 8, no. February, pp. 1–12, 2021.
[81] J. Abeßer, “USM-SED - A Dataset for Polyphonic Sound Event Detection in Urban Sound Monitoring Scenarios,” 2021. [Online]. Available: http://arxiv.org/abs/2105.02592
[82] J. J. Walker, L. M. Cleveland, J. L. Davis, and J. S. Seales, “Audiometry screening and interpretation,” American Family Physician, vol. 87, no. 1, pp. 41–47, 2013.
[83] G. M. Echevarria Sanchez, T. Van Renterghem, K. Sun, B. De Coensel, and D. Botteldooren, “Using Virtual Reality for assessing the role of noise in the audio-visual design of an urban public space,” Landscape and Urban Planning, vol. 167, pp. 98–107, 2017.
[84] M. Wang, Y. Ai, Y. Han, Z. Fan, P. Shi, and H. Wang, “Extended high-frequency audiometry in healthy adults with different age groups,” Journal of Otolaryngology - Head and Neck Surgery, vol. 50, no. 1, pp. 1–6, 2021.
[85] M. Andrew, T. Oberman, F. Aletta, M. Kachlicka, M. Lionello, M. Erfanian, and J. Kang, “Investigating urban soundscapes of the COVID-19 lockdown: A predictive soundscape modeling approach,” The Journal of the Acoustical Society of America, vol. 150, no. 6, pp. 4474–4488, 2021.
[86] International Organization for Standardization, ISO 12913-1:2014 - Acoustics - Soundscape - Part 1: Definition and conceptual framework. Geneva, Switzerland: International Organization for Standardization, 2014.
[87] N. D. Weinstein, “Individual differences in reactions to noise: A longitudinal study in a college dormitory,” Journal of Applied Psychology, vol. 63, no. 4, pp. 458–466, 1978.
[88] S. Cohen, T. Kamarck, and R. Mermelstein, “A Global Measure of Perceived Stress,” Journal of Health and Social Behavior, vol. 24, no. 4, pp. 385–396, 1983.
[89] WHO Regional Office for Europe, Wellbeing measures in primary health care, Cophenhagen, Denmark, 1998.
[90] D. Watson, L. A. Clark, and A. Tellegan, “Development and Validation of Brief Measures of Positive and Negative Affect: The PANAS Scales,” Journal of Personality and Social Psychology, vol. 54, no. 6, pp. 1063–1070, 1988.
[91] F. Aletta, et al., “The relationship between noise sensitivity and soundscape appraisal of care professionals in their work environment: a case study in Nursing Homes in Flanders, Belgium,” in Proceedings of Euro-Noise 2018, 2018.
[92] E. Ratcliffe, “Sound and Soundscape in Restorative Natural Environments: A Narrative Literature Review.” Frontiers in Psychology, vol. 12, p. 570563, 2021.
[93] M. Masullo, et al., “A questionnaire investigating the emotional salience of sounds,” Applied Acoustics, vol. 182, p. 108281, 2021.
[94] S. Cohen and G. Williamson, “Perceived stress in a probability sample of the United States,” The Social Psychology of Health, vol. 13, pp. 31–67, 1988.
[95] A. Mitchell, F. Aletta, and J. Kang, “How to analyse and represent quantitative soundscape data,” JASA Express Letters, vol. 2, p. 037201, 2022.
[96] R. Guski, D. Schreckenberg, and R. Schuemer, “A systematic review on environmental noise and annoyance,” International Journal of Environmental Research and Public Health, vol. 14, no. 12, pp. 1–39, 2017.
[97] A. Hong, B. Kim, and M. Widener, “Noise and the city: Leveraging crowdsourced big data to examine the spatio-temporal relationship between urban development and noise annoyance,” Environment and Planning B: Urban Analytics and City Science, vol. 47, no. 7, pp. 1201–1218, 2020.
[98] T. Van Renterghem and D. Botteldooren, “Effect of a row of trees behind noise barriers in wind,” Acta Acustica united with Acustica, vol. 88, no. 6, pp. 869–878, 2002.
[99] W. Yang and J. Kang, “Acoustic comfort evaluation in urban open public spaces,” Applied Acoustics, vol. 66, no. 2, pp. 211–229, 2005.
[100] X. Fang, et al., “Soundscape Perceptions and Preferences for Different Groups of Users in Urban Recreational Forest Parks,” Forests, vol. 12, no. 4, p. 468, 2021.
[101] L. J. Cronbach, “Coefficient alpha and the internal structure of tests,” Psychometrika, vol. 16, no. 3, pp. 297–334, 1951.
[102] R. P. McDonald, Test Theory: A Unified Treatment. Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc., 1999.
[103] K. S. Taber, “The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education,” Research in Science Education, vol. 48, no. 6, pp. 1273–1296, 2018.
[104] D. Worthington, “Weinstein Noise Sensitivity Scale (WNSS),” in The Sourcebook of Listening Research: Methodology and Measures, 1st ed., D. Worthington and G. Bodie, Eds. John Wiley and Sons Ltd, 2018, pp. 475–481.
[105] K. Ooi, K. N. Watcharasupat, B. Lam, Z.-T. Ong, and W.-S. Gan, “Probably Pleasant? A Neural-Probabilistic Approach to Automatic Masker Selection for Urban Soundscape Augmentation,” in Proceedings of IEEE ICASSP 2022, 2022, p. 5.
[106] K. N. Watcharasupat, K. Ooi, B. Lam, T. Wong, Z.-T. Ong, and W.-S. Gan, “Autonomous In-Situ Soundscape Augmentation via Joint Selection of Masker and Gain,” pp. 1–5, 2022. [Online]. Available: http://arxiv.org/abs/2204.13883
[107] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society. Series B: Statistical Methodology, vol. 67, no. 5, p. 768, 2005.
[108] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proceedings of ICLR 2015, 2015, pp. 1–15.
[109] K. Ma, C. Mak, and H. Wong, “Effects of environmental sound quality on soundscape preference in a public urban space,” Applied Acoustics, vol. 171, 2021.
[110] G. R. Kidd and C. S. Watson, “The perceptual dimensionality of environmental sounds,” Noise Control Engineering Journal, vol. 51, no. 4, pp. 216–231, 2003.

Appendix A Source Files for ARAUS Dataset

This appendix contains details of the exact source files used to generate the augmented soundscapes in the ARAUS dataset. Table A.1 shows the breakdown (by fold) of the soundscape recordings from the Urban Sounds of the World (USotW) database used as base soundscapes for the five-fold cross-validation set of the ARAUS dataset. The first half of R0091 was used as the “pre-experiment”, “attention”, and “post-experiment” stimulus, so R0091 is not considered to be part of any fold in the ARAUS dataset. Table A.2 shows the breakdown (by fold) of the source recordings from Freesound and Xeno-Canto used as maskers for both the five-fold cross-validation set and independent test set. Panoramic photos and GPS coordinates of the locations recorded for the independent test set of the ARAUS dataset are shown in Fig. A.1.

TABLE A.1: List of Urban Soundscape of the World (USotW) recordings used for cross-validation set of ARAUS dataset. Recordings are denoted using their index numbers in the USotW database. Single asterisks (^∗) and daggers (^†) respectively denote recordings whose first and second halves were omitted from the ARAUS dataset. As of publication time, R0021, R0077, R0086, R0093, R0100, R0102 are not listed in the USotW database.

Fold	USotW identifiers
Fold 1	R0005	R0011	R0017	R0018	R0027	R0046	R0048	R0055	R0056	R0071	R0079	R0080
	R0081	R0084^∗	R0089	R0092	R0095	R0111^†	R0115	R0116	R0122	R0124^∗	R0126	R0129
Fold 2	R0001	R0006	R0016	R0020	R0025	R0026	R0033	R0037	R0040	R0044	R0052	R0053
	R0067	R0076	R0078	R0082	R0088	R0103	R0106	R0107	R0110	R0117	R0123	R0133
Fold 3	R0009	R0012	R0024^∗	R0028	R0029	R0032	R0034	R0041	R0042	R0045	R0051	R0057
	R0059	R0063	R0065	R0070	R0085^†	R0094	R0104	R0109	R0114	R0118	R0120	R0132
Fold 4	R0002	R0003	R0004	R0013	R0015	R0019	R0022	R0031	R0039	R0043	R0047^†	R0050
	R0054	R0061	R0072	R0074	R0087	R0097	R0098	R0099	R0112	R0113	R0127	R0131
Fold 5	R0007	R0008	R0010	R0023	R0030	R0035	R0036	R0038	R0049	R0058	R0060	R0062
	R0064	R0069	R0073	R0075	R0090	R0096	R0105	R0108	R0119	R0121	R0128	R0130

TABLE A.2: List of tracks from Freesound (denoted as “FS”) and Xeno-canto (denoted as “XC”) used as maskers in ARAUS dataset. Numbers after “FS” and “XC” denote the index numbers of the tracks in the Freesound and Xeno-canto databases, respectively.

	Masker track identifier
Fold	Bird		Construction	Traffic	Water		Wind
Test	XC568124	XC640568	FS586168	FS587219	FS587000	FS587759	FS587205
1	XC109203	XC482053	FS218748	FS84646	FS202915	FS463265	FS181255
	XC134886	XC503057	FS246171	FS235531	FS345649	FS516934	FS244942
	XC184374	XC518767	FS400991	FS243720	FS376801	FS541717	FS403051
	XC185560	XC518843	FS421050	FS330288	FS410927	FS547136	FS444921
	XC311306	XC556166	FS455683	FS337095	FS412308	FS547892	FS454373
	XC370500	XC571000	FS553476	FS426886	FS415151	FS548476	FS483076
	XC419391	XC612865	FS555037	FS454864	FS433589	FS550930	FS546527
	XC470089	XC613796	FS555039	FS504138	FS450755	FS553051	FS548173
2	FS257445	XC477488	FS74505	FS67704	FS260056	FS441152	FS144083
	FS317450	XC481933	FS134896	FS77016	FS336848	FS457565	FS185070
	XC85417	XC509271	FS193351	FS160015	FS346641	FS459983	FS242064
	XC133059	XC537855	FS194866	FS191350	FS365915	FS533932	FS422579
	XC301810	XC553169	FS522584	FS322231	FS400402	FS534124	FS423800
	XC332810	XC562095	FS553475	FS370009	FS401277	FS544183	FS444852
	XC370485	XC600722	FS555025	FS433869	FS411509	FS550756	FS475448
	XC420908	XC608496	FS555040	FS448092	FS423622	FS552457	FS540192
3	FS478637	XC566219	FS171400	FS93329	FS56771	FS388706	FS62056
	XC122469	XC575300	FS289478	FS118046	FS169181	FS414762	FS73714
	XC137556	XC591803	FS335963	FS160002	FS185062	FS415027	FS346106
	XC140239	XC598469	FS361777	FS173154	FS243629	FS436813	FS397947
	XC350943	XC601752	FS376405	FS336649	FS249485	FS537986	FS405561
	XC485841	XC602571	FS383442	FS394689	FS256009	FS543579	FS423914
	XC552316	XC602677	FS545183	FS439218	FS329680	FS546236	FS438857
	XC554222	XC603552	FS555038	FS546474	FS384276	FS551989	FS491296
4	XC50850	XC480035	FS62394	FS65807	FS260980	FS478439	FS104875
	XC112734	XC570338	FS149777	FS84645	FS264598	FS511098	FS131032
	XC368749	XC579905	FS169098	FS149814	FS396056	FS542260	FS180025
	XC375911	XC580986	FS170873	FS252216	FS438925	FS544838	FS182837
	XC445899	XC600707	FS173106	FS274830	FS448766	FS546107	FS211820
	XC449522	XC601772	FS192176	FS434302	FS454283	FS546804	FS345681
	XC457222	XC604251	FS483556	FS463636	FS459850	FS548381	FS457318
	XC467315	XC611766	FS555035	FS512142	FS460441	FS553159	FS546526
5	XC203258	XC489739	FS112637	FS23273	FS62395	FS489073	FS84111
	XC242969	XC505573	FS118042	FS110309	FS167034	FS536223	FS104952
	XC255139	XC545465	FS328141	FS180156	FS169250	FS547012	FS441866
	XC348370	XC555184	FS362068	FS259629	FS261445	FS547232	FS469280
	XC376468	XC578478	FS384837	FS261344	FS352902	FS547720	FS533930
	XC450847	XC600292	FS488828	FS262830	FS365919	FS549929	FS546759
	XC475279	XC604437	FS555026	FS503456	FS376709	FS552691	FS548235
	XC482274	XC614063	FS555034	FS551451	FS469009	FS553135	FS551318

Appendix B Participant Information Questionnaire

This appendix lists the all the items for the participant information questionnaire (PIQ), which contains demographic and auxiliary information that we collected from the participants who contributed their responses to the ARAUS dataset.

The questionnaire was administered to all participants via a digital form. All questions were multiple-choice questions except for Question 2(a), which asked for the age of the participant and accepted positive integers. Possible options for the participants and numerical values used to code the responses for the dataset are shown in itemized square brackets. Yes/no questions are indicated by [Y/N] at the end of the questions. “No” is coded as 0 and “Yes” is coded as 1.

{mdframed}

1.
Spoken languages
1. (a)
  
  Do you speak fluently in any languages/dialects other than English? [Y/N]
  If your response is “No”, please go to Question 2.
2. (b)
  
  Is English your first or native language? [Y/N]
3. (c)
  
  Among the languages/dialects you speak, would you consider yourself to be most fluent in English? [Y/N]

{mdframed}

2.
Demographic information
1. (a)
  
  What is your age?
2. (b)
  
  What is your gender?
  [0] Male [1] Female [0 or 1] Other/prefer not to say⁴⁴4Due to the risk of identification considering a very small number of participants chose this option, where participants responded “Other/prefer not to say”, we randomly coded the response as 0 or 1, each with a probability of $50\text{\,}\mathrm{\char 37\relax}$ .
3. (c)
  
  What is your ethnic group?
  [0] Other [1] Chinese [2] Malay [3] Indian
4. (d)
  
  What is the highest level of education you have completed?
  [0] Other [1] No qualification [2] Primary (PSLE), elementary school or equivalent [3] Secondary (GCE ‘N’ & ‘O’ level), middle school or equivalent [4] Institute of Technical Education or equivalent [5] Junior College (‘A’ level), high school or equivalent [6] Polytechnic and Arts Institution (Diploma level) or equivalent [7] University (Bachelor’s Degree) or equivalent [8] University (Master’s Degree) or equivalent [9] University (PhD)
5. (e)
  
  What is your occupational status?
  [0] Other [1] Student [2] Employed [3] Retired [4] Unemployed
  If your response is “Student”, please go to Question 2(g)
6. (f)
  
  What is the highest level of education you are currently undergoing?
  [0] Other
  [2] Primary (PSLE), elementary school or equivalent [3] Secondary (GCE ‘N’ & ‘O’ level), middle school or equivalent [4] Institute of Technical Education or equivalent [5] Junior College (‘A’ level), high school or equivalent [6] Polytechnic and Arts Institution (Diploma level) or equivalent [7] University (Bachelor’s Degree) or equivalent [8] University (Master’s Degree) or equivalent [9] University (PhD)
7. (g)
  
  What dwelling type is your current main residence in Singapore?
  [0] Other [1] Housing Development Board (HDB) flat or other public apartment [2] Hall of Residence or other student dormitory [3] Landed property [4] Condominium or other private apartment
8. (h)
  
  Are you a Singapore citizen? [Y/N]
9. (i)
  
  Have you resided in Singapore for more than 10 years? [Y/N]

{mdframed}

3.

How much has indoor/outdoor noise bothered, disturbed, or annoyed you over the past 12 months?

Not at all [0] $\cdot$ 11-point Likert $\cdot$ [10] Extremely

{mdframed}

4.

How would you describe your satisfaction of the overall quality of the acoustic environment in Singapore?

Extremely dissatisfied [0] $\cdot$ 11-point Likert $\cdot$ [10] Extremely satisfied

{mdframed}

5.
Shortened Weinstein Noise Sensitivity Scale²²2Items marked with single asterisks (*) were reverse coded. In other words, “Strongly disagree” was coded as “5” and “Strongly agree” was coded as “1” for these items. The asterisks were not shown in the digital form presented to the participants.: Below are a number of statements addressing individual reactions to noise. After reading each statement, please select the option that best represents your level of agreement with the statement.

[1] Strongly disagree [2] Disagree [3] Neither agree nor disagree [4] Agree [5] Strongly agree
1. (a)
  
  I complain once I have run out of patience with a noise.
2. (b)
  
  I am annoyed even by low noise levels.
3. (c)
  
  I would not want to live on a noisy street, even if the house was nice.
4. (d)
  
  I get mad at people who make noise that keeps me from falling asleep or getting work done.
5. (e)
  
  I cannot fall asleep easily when there is noise.
6. (f)
  
  I am sensitive to noise.
7. (g)
  
  I am easily awakened by noise.
8. (h)
  
  I get used to most noises without much difficulty.*
9. (i)
  
  I find it hard to relax in a place that’s noisy.
10. (j)
  
  I’m good at concentrating no matter what is going on around me.*

{mdframed}

6.
Shortened Perceived Stress Scale: The questions in this scale ask you about your feelings and thoughts during the last month. In each case, you will be asked to indicate how often you felt or thought a certain way. Although some of the questions are similar, there are differences between them and you should treat each one as a separate question. The best approach is to answer each question fairly quickly. That is, don’t try to count up the number of times you felt a particular way, but rather indicate the alternative that seems like a reasonable estimate.

[0] Never [1] Almost never [2] Sometimes [3] Fairly often [4] Very often

In the last month, how often have you…
1. (a)
  
  been upset because of something that happened unexpectedly?
2. (b)
  
  felt that you were unable to control the important things in your life?
3. (c)
  
  felt nervous and “stressed”?
4. (d)
  
  felt confident about your ability to handle your personal problems?
5. (e)
  
  felt that things were going your way?
6. (f)
  
  found that you could not cope with all the things that you had to do?
7. (g)
  
  been able to control irritations in your life?
8. (h)
  
  felt that you were on top of things?
9. (i)
  
  been angered because of things that were outside of your control?
10. (j)
  
  felt difficulties were piling up so high that you could not overcome them?

{mdframed}

7.
WHO-5 Well Being Index: For each of the statements below, which is the closest to how you have been feeling over the last two weeks?

[0] At no time [1] Some of the time [2] Less than half of the time [3] More than half of the time [4] Most of the time [5] All of the time
1. (a)
  
  I have felt cheerful and in good spirits.
2. (b)
  
  I have felt calm and relaxed.
3. (c)
  
  I have felt active and vigorous.
4. (d)
  
  I woke up feeling fresh and rested.
5. (e)
  
  My daily life has been filled with things that interest me.

{mdframed}

8.

Positive and Negative Affect Schedule³³3Items corresponding to Positive Affect (marked with the oplus symbol ^⊕) and Negative Affect (marked with the ominus symbol ^⊖) were tallied to give separate Positive Affect and Negative Affect scores. The oplus and ominus symbols were not shown in the digital form presented to the participants.: In the last two weeks, to what extent have you felt this way?

[1] Very slightly or not at all [2] A little [3] Moderately [4] Quite a bit [5] Extremely

a) Interested^⊕ b) Distressed^⊖ c) Excited^⊕ d) Upset^⊖ e) Strong^⊕ f) Guilty^⊖ g) Scared^⊖ h) Hostile^⊖ i) Enthusiastic^⊕ j) Proud^⊕ k) Irritable^⊖ l) Alert^⊕ m) Ashamed^⊖ n) Inspired^⊕ o) Nervous^⊖ p) Determined^⊕ q) Attentive^⊕ r) Jittery^⊖ s) Active^⊕ t) Afraid^⊖

Appendix C Distributions and Statistical Test Results for PIQ and ARQ

This appendix collates the detailed distributions and statistical test results for the PIQ and ARQ described in Section 4. Fig. C.1 and Fig. C.2 respectively consolidate the distributions of the PIQ items coded as continuous and categorical variables, as violin plots by fold in the cross-validation set. Fig. C.3 presents the distributions of the single-value metrics described in Section 3.9, as violin plots by fold in the cross-validation set and test set. Tables C.1 and C.2 respectively consolidate the $\chi^{2}$ and Kruskal-Wallis test results for the PIQ items coded as continuous and categorical variables, and Table C.3 presents the Kruskal-Wallis test results for the single-value metrics described in Section 3.9.

TABLE C.1: Summary of results of Kruskal-Wallis tests for PIQ responses coded as continuous variables, by fold. The test statistic for the Kruskal-Wallis test is denoted by

H

and the corresponding

p

-values are given as

p

. No significant differences (

p>0.05

) were observed between distributions of all variables by fold, regardless of whether the test set was included as an additional fold.

		Cross-validation set		Cross-validation set + test set
	Variable	$H$	$p$	$H$	$p$
(a)	Age	1.6519	0.7994	4.4477	0.4869
(b)	Extent annoyed by noise over past 12 months	3.2587	0.5155	3.3215	0.6506
(c)	Satisfaction of overall acoustic environment in Singapore	4.7949	0.3090	5.0618	0.4084
(d)	Score on WNSS-10	2.3861	0.6651	2.4252	0.7877
(e)	Score on PSS-10	1.1454	0.8870	2.5283	0.7722
(f)	Score on WHO-5	2.7427	0.6018	4.0850	0.5372
(g)	Score on PANAS for Positive Affect	2.0961	0.7181	3.2124	0.6673
(h)	Score on PANAS for Negative Affect	1.2815	0.8645	1.3094	0.9340

TABLE C.2: Summary of results of

\chi^{2}

-tests for PIQ responses coded as categorical variables, by fold. The test statistic for the

\chi^{2}

-test is denoted by

\chi^{2}

and the corresponding

p

-values are given as

p

. Significant differences (

p<0.05

) were observed only for the test set in terms of the ethnicity of the participants.

		Fold
		Test		1		2		3		4		5
	Variable	$\chi^{2}$	$p$	$\chi^{2}$	$p$	$\chi^{2}$	$p$	$\chi^{2}$	$p$	$\chi^{2}$	$p$	$\chi^{2}$	$p$
(a)	If participant was multilingual	0.6114	0.4343^∗	0.2443	0.6211	0.0153	0.9017	0.1870	0.6654	0.0153	0.9017	0.0344	0.8530
(b)	If participant was a native English speaker	0.8026	0.6694^∗	0.3707	0.8308	1.9425	0.3786	1.3122	0.5189	0.6725	0.7145	1.6799	0.4317
(c)	If participant considered themselves most fluent in English	1.7748	0.4117^∗	1.3885	0.4995	0.6943	0.7067	0.4855	0.7845	0.5547	0.7578	0.2225	0.8947
(d)	Gender	1.3607	0.2434^∗	0.2630	0.6081	2.5980	0.1070	0.5918	0.4417	0.3435	0.5578	0.5918	0.4417
(e)	Ethnicity	11.7723	0.0082^∗	1.0508	0.7890	3.5286	0.3171	0.6204	0.8917	0.9676	0.8091	2.6249	0.4531
(f)	Highest education level obtained	3.3766	0.7603^∗	3.8880	0.6918	1.5046	0.9592	3.7658	0.7083	1.7228	0.9433	2.9474	0.8154
(g)	Occupational status	0.2967	0.9900^∗	3.0213	0.5543	3.3845	0.4957	5.3639	0.2520	2.8398	0.5850	0.8643	0.9296
(h)	Current education level	0.8517	0.9736^∗	2.1174	0.8327	0.6463	0.9858	5.0025	0.4156	2.5056	0.7756	4.2588	0.5128
(i)	Dwelling type	1.8017	0.6146^∗	3.1992	0.3619	0.7510	0.8612	2.9759	0.3954	1.5848	0.6628	1.0186	0.7968
(j)	If participant was a Singapore citizen	0.4287	0.5126^∗	0.0829	0.7734	1.3259	0.2495	1.3259	0.2495	2.3152	0.1281	0.2435	0.6217
(k)	Length of residence in Singapore	0.7960	0.3723^∗	0.0300	0.8626	0.0674	0.7951	1.7997	0.1798	1.0787	0.2990	0.1517	0.6969

TABLE C.3: Summary of results of Kruskal-Wallis tests for consistency checks on ARQ responses, by fold. The test statistic for the Kruskal-Wallis test is denoted by

H

and the corresponding

p

-values are given as

p

. No significant differences (

p>0.05

) were observed between distributions of all variables by fold, regardless of whether the test set was included as an additional fold. MAD stands for mean absolute deviation. MSE stands for mean squared error. Except (a), the MAD and MSE were computed by taking the mean across all stimuli presented to single participant.

Cross-validation set Cross-validation set + test set Variable $H$ $p$ $H$ $p$ (a) MAD between ARQ responses on first (“pre-experiment”) and last (“post-experiment”) stimuli 1.7164 0.7877 2.9306 0.7107 (b) MAD between reversed “pleasant” ( $6-r_{\text{pl}}$ ) and “annoying” ( $r_{\text{an}}$ ) 5.0281 0.2844 9.7446 0.0828 (c) MAD between reversed “eventful” ( $6-r_{\text{ev}}$ ) and “uneventful” ( $r_{\text{ue}}$ ) 5.1821 0.2691 5.6611 0.3406 (d) MAD between reversed “calm” ( $6-r_{\text{ca}}$ ) and “chaotic” ( $r_{\text{ch}}$ ) 1.8261 0.7677 5.5218 0.3556 (e) MAD between reversed “vibrant” ( $6-r_{\text{vi}}$ ) and “monotonous” ( $r_{\text{mo}}$ ) 5.3364 0.2545 5.3441 0.3753 (f) MSE between “pleasant” ( $r_{\text{pl}}$ ) and rescaled “ISO Pleasantness” ( $3+2P$ ) 1.3749 0.8485 2.0420 0.8433 (g) MSE between “eventful” ( $r_{\text{ev}}$ ) and rescaled “ISO Eventfulness” ( $3+2E$ ) 4.8070 0.3077 6.9141 0.2271

Appendix D Details of Principal Component Analysis for Fold Allocation

This appendix presents heat maps of the weights given to each input feature when the principal component analysis (PCA) in the fold allocation process described in Section 3.3 was conducted for the set of base urban soundscapes, as well as the sets of maskers in each class. Only the first $k$ principal components for each set are shown in the heat maps, where $k$ is the number of principal components that together explained at least 90% of the observed variance.

Since the PCA was conducted independently for the set of base urban soundscapes, as well as the sets of maskers in each class, the assumption here was that each set had its own independent PCA that best summarized the input features for that set. We can observe from the heat maps that this is indeed the case, with the weights being significantly different between the sets. In addition, we can also see that weights with relatively higher absolute values are clustered together in blocks representing each class of psychoacoustic indicators (e.g., principal component #2 for the set of soundscape tracks has weights with comparatively high absolute values for the sharpness-related indicators), which indicates that the principal component dimensions appear to be summarizing related features into the individual principal components. This is as expected since the purpose of applying PCA in Section 3.3 was purely for dimensionality reduction, in order to reduce the number of (redundant) parameters for the subsequent clustering via self-organizing maps.