1 Introduction
Audio notifications, like ringtones, play a crucial role in how users receive information and are widely employed by modern digital devices, especially mobile phones, to relay time-sensitive updates, including incoming calls, upcoming calendar events, or messages. Presently, most auditory notifications are intentionally designed to be highly noticeable, ensuring users don’t miss them. While this design effectively captures the user’s attention, it can also lead to disruptions in various contexts.
Consider this scenario: a user is engrossed in a book while enjoying their favorite song (Figure
1). In the event of a notification, most digital devices will automatically lower the background music volume and deliver the alert audibly. While this approach guarantees that the user detects the notification, it often proves distracting and disruptive to the user’s music-listening experience. This is particularly true for non-urgent notifications that require a timely but not necessarily immediate response by end users.
Previously, researchers have addressed this challenge through methods like substituting notifications with audio effects [
5,
9], integrating music snippets in pre-composed music soundscapes [
16,
35], and embedding ringtone sounds in single-timbre music using timbre transfer techniques [
53,
54]. While these prior approaches have shown promise in creating more musically integrated notifications, they are subject to two significant limitations. First, they often rely on completely replacing familiar notifications, which can diminish recognizability and user comfort. Second, many methods are customized for predefined musical contexts or require songs to be composed with notifications in mind, limiting their adaptability to the diverse music preferences of today’s users.
In our work, we introduce a new method that seamlessly incorporates audio notifications into users’ musical experiences. Inspired by digital music practices such as disc jockeying and remixing, along with concepts from music information retrieval (MIR), we provide an initial exploration of the parameter design space for auditory manipulations to create music-adaptive audio notifications. These parameters, including beat matching, key matching, and timbre modifications, are designed to facilitate the seamless integration of notifications into musical sequences while allowing for customizable degrees of blending. Our work lays the foundation for future, more exhaustive explorations of the design space of music-adaptive notifications.
To validate these parameters and investigate the user experience of a more music-adaptive approach to delivering audio notifications, we further developed MARingBA, an interactive system enabling real-time manipulation of notifications. MARingBA is designed for content creators and designers of audio notifications. MARingBA incorporates a suite of automated mechanisms for extracting music information and serves as a prototype interface for experimenting with and creating music-adaptive notifications using our design space parameters. MARingBA uses a novel combination of established techniques in music computing to enable the rapid exploration of music-adaptive notifications. With its various parameter settings, content creators and designers can quickly define ways of automatically integrating ringtones into multiple songs.
Through two studies, we gather insights into our approach, design space, and system from the perspective of two main stakeholders: (1) content creators responsible for designing audio notifications, and (2) end users who may receive these notifications in the future. Our initial study with six music experts revealed that our design space is highly expressive, enabling them to tailor notification designs to diverse contexts. They were notably able to use MARingBA to blend notifications with multiple songs and accommodate various noticeability and urgency requirements (e. g., designing for casual weather alerts versus work scenarios requiring an immediate response). In a second experiment with end users, we preliminarily evaluated whether our parameters could modulate the noticeability of audio notifications while producing a preferred user experience to standard notification delivery mechanisms.
In summary, we make the following contributions:
•
An initial design space exploration of parameters for adapting notifications to a background musical context in a harmonic manner, which also enables the modulation of its noticeability,
•
MARingBA, a system that implements our design space parameters for authoring music-adaptive notifications by leveraging a novel combination of techniques from music computing,
•
Insights from a study with content creators (n = 6) on the utility of our parameters and system for creating music-adaptive notifications,
•
Results from an initial usage study with end users (n = 12) showing that an example set of adaptation parameters yielded notifications that are preferred over a standard volume-fading baseline while exhibiting controllable detection rates.
3 Background
We first provide a summary of background information on several well-established fundamental music concepts relevant to our work. This section can be skipped by knowledgeable readers. For a more comprehensive introduction to music theory and computer music, we refer interested readers to Blatter [
10] and Collins [
17] respectively. Additionally, we describe the limitations of current audio notification delivery approaches.
3.1 Relevant music concepts
Music concepts relevant to our work include
tempo,
pitch,
note, and
key. Throughout the paper, we illustrate these parameters similar to Figure
2. Horizontal bars represent individual notes, played in a specific key (y-axis). The notes follow a certain tempo (i. e., beat), indicated by the grid cells.
Tempo refers to the speed or pace at which a piece of music is performed, typically measured in beats per minute (BPM). This intuitively corresponds to the rate at which people naturally tap their feet when they listen to music.
Pitch refers to a sound’s perceived highness or lowness, which is determined by the frequency of its vibrations and measured typically in hertz (Hz). Higher frequencies correspond to higher pitches, and lower frequencies correspond to lower pitches. In music composition, sounds of different pitches are referred to as notes.
A key refers to a set of pitches or notes. Songs and musical compositions typically adhere to notes belonging to a single key (e. g., C major). Introducing additional off-key notes typically results in undesirable dissonance (i. e., keys are not in harmony), with notable exceptions in experimental music and jazz, for example, in which dissonance is carefully used as a design element.
Timbre is a broad term used to describe the unique characteristics of a sound that differentiate it from its pitch and is colloquially described as the “quality of a musical note or sound.” It is determined by the combination of overtone frequencies of a sound. An effective way to grasp timbre is by considering, for example, how a guitar and a violin can play the same music at the same pitch and intensity yet sound distinct from each other.
3.2 Audio notifications
On current devices, audio notifications are typically delivered in one of two ways: either they mute whatever the user is listening to or they are directly overlaid. If the user was previously listening to music, the muting approach ensures that the notification is noticed but fully interrupts the user’s listening experience.
Alternatively, if the notification is overlaid, users can still hear the music, albeit quieter if its volume is decreased. The sounds from the two sources, however, may clash and result in an unpleasant experience for the human ear. From a musical perspective, this dissonance can be attributed to several factors, such as misalignments in tempo or key, as illustrated in Figure
2. Most music is composed at a consistent tempo, so introducing a rhythmic sound that doesn’t align with this tempo, e. g., because it is faster or slower, can disturb the listener. Similarly, when the pitch of the notification doesn’t match the music’s key, it will be perceived as out of place. Overall, unless an immediate user response is required, both conventional notification delivery mechanisms—mute and overlay—may lead to a sub-optimal experience, as they do not consider the musical context in which the user is situated.
4 The MARingBA Parameter Space
Our goal is to automatically generate ringtone-music blends that resemble the quality of manually crafted mixes by human mash-up artists. To achieve this, we propose an approach centered around defining an initial set of distinct music feature modification parameters. These parameters are grounded in music theory, and inspired by music practices that involve blending multiple musical audio sequences together, such as DJing or sampling. Although our design space exploration is not exhaustive, it offers foundational insights into our approach and points toward potential areas that warrant further investigation. In the following, we provide an in-depth description of these parameters, their conceptual implementation, and their role in achieving effective ringtone-music adaptation.
4.1 Beat matching
Beat matching refers to slowing down or speeding up one or both of the clips until their tempo becomes the same. This technique is used by DJs and mash-up artists to align the tempo of two different songs and create a synchronized mix that listeners can dance to. In addition, beat matching refers to manually synchronizing the onset of beats to align across songs.
Assuming the timestamp of every beat in a piece of music is known, e. g., by using rhythm extraction software [
14], we first calculate the average interval between all pairwise beats in a song. The average tempo in beats-per-minute (BPM) is then calculated as
We can then use the tempo estimation to beat-match the notification audio to synchronize with the music by calculating
The time stretch amount is then applied to the ringtone to match the tempo of the music. Furthermore, the beat onset of the ringtone is aligned with the beat of the music. Figure
3 illustrates a naive implementation with misaligned beats and a beat-matched version.
4.2 Key matching
When pitches outside of the music’s key are introduced, dissonant clashes can occur. Key matching, sometimes also referred to as harmonic mixing, makes sure two songs that are being blended share a similar key. One simple way to ensure harmony is to shift the pitch (i. e., frequencies) of the key of the ringtone to match the key of the song. In Western music theory, frequencies are split into 12 equidistant notes, which are the main building blocks for all keys. The distance between two consecutive notes is referred to as a semitone. It is represented as a frequency ratio of the twelfth root of two. Semitones are the smallest interval in music and can be used as a metric to define other intervals in terms of the number of semitones between them. Key matching through pitch shifting relates to moving between notes by shifting their frequencies with the correct number of semitones.
To achieve this, the key-matched frequency is calculated as
As an example, matching a ringtone in the key of B to a song in the key of C, i. e., a difference of one note, the ringtone should be pitch-shifted upwards by 1 semitone (e. g., multiplying the frequency of B by (12th root of 2) results in the frequency of C). As shown in Figure
4, the ringtone does not use pitches that fit into the key of the music. We rectify this by pitch-shifting the ringtone up by one semitone, avoiding dissonant notes and staying in harmony with the music.
4.3 Scheduling
Conventional notification delivery mechanisms typically alert the user as soon as notifications are received. In a musical context, this may coincide with an undesirable temporal placement where the notification is not aligned with the background rhythm. There are several ways to mitigate this challenge, including waiting with playback until the next beat, next bar, or the start of the next four bars. In music theory, a bar, or measure, is a segment of time that involves the grouping of multiple beats, usually four in mainstream music. Structurally, music often has repetitions and variations that happen at the start of every four bars, making those positions suitable candidates to blend ringtones naturally in the structure of the music.
4.4 Panning
Panning refers to how the audio is distributed across different channels in a stereo sound system (e. g., headphones). A sound that is panned to the left will result in more volume from the left speaker, for example. Panning can be used in music-adaptive ringtones to make the sound stand out as coming from a different direction than other sounds presented in the mix of music. In our approach, the ringtone sound is panned from the side which holds less intensity. The song shown in Figure
5, for example, begins with a sequence that is panned to the left side. A ringtone could be panned to the right to counterbalance the difference in volume, and balance the stereo sound.
4.5 Timbre
There exists various way to manipulate the timbre of the music, including instrument transfer, reverb dry/wet level, reverb decay time, and setting high pass and low pass thresholds.
•
Instrument Transfer. Aside from the original sound of each ringtone, we allow the ringtone’s instrument to be replaced with either a piano, violin, or synthesizer, while keeping the same melody and rhythm as the original. Transferring the instrument tone used to play the ringtone can result in a more integrated mix depending on the instrumentation of the song. The original iPhone ringtone, for example, is played on the Marimba, a type of mallet percussion with African origins. This ringtone might not blend well into synth-heavy dance tracks, and would benefit from being transferred to a synthesizer, for example.
•
Reverb. Reverb refers to the reflections of a sound from the environment it is played in. More spacious environments lead to more reverb signals and longer decay times. Adding simulated reverb on ringtones helps with music integration, as reverb makes the sound more natural by simulating its presence in a physical space.
•
Frequency filtering. Frequency filtering is the removal of frequency content from a sound that is rich in frequencies. If there were a significant overlap in frequencies between the ringtone and music, the ringtone would be less noticeable. Removing part of the ringtone’s frequencies can benefit ringtone integration because it can “free up” room for the song, without significantly impacting the ringtone’s volume.
4.6 Volume
Lastly, we can manipulate the volume of the ringtone and music clips in several ways.
•
Ringtone volume. The volume of a ringtone directly affects how noticeable it is compared to background music. Lower-volume ringtones tend to blend more subtly than high-volume ones.
•
Fade-in. Fade-in refers to starting at a low volume and gradually transitioning back to a stable level of volume. This transition eases the introduction of the ringtone and makes integration smoother.
•
Adaptive volume. Rather than setting a fixed ringtone volume, we support configuring it to automatically adapt based on the input music — increasing for louder songs and decreasing for softer ones.
•
Track-specific attenuation. In addition to adjusting the ringtone volume, we also support making volume adjustments to specific tracks within the background music. For instance, we can temporarily lower the vocals’ volume in a song while maintaining full accompaniment volume. This creates additional room for blending in the ringtone without interrupting the music flow. We note that individual tracks can be acquired either directly from the original mix or automatically extracted using source separation technology (e. g., Spleeter [
28]).
5 The MARingBA System
The goal of MARingBA is to create music-adaptive ringtones that seamlessly blend with and adjust to any music a user listens to in a contextually sensitive manner. Notification adaptations should account for the ambient musical context and consider various usage scenarios, particularly the urgency of their contents, which arguably determines their desirable level of noticeability. To achieve this, we implemented the adaptation parameters described in section
4. Content creators can utilize these diverse manipulations to specify how and to what degree ringtones should adapt to and blend with any given song, consequently influencing their noticeability. We demonstrate this in our initial expert study and validate the designed ringtones in the preliminary user study.
Our current implementation includes features for
extracting music information from the ringtone and background music, and performing
real-time notification feature manipulation. An overview of the system is illustrated in Figure
6. Apart from information extraction and audio manipulations, MARingBA provides content creators with mechanisms to select target notifications and songs for testing the parameters, playback controls, and manual triggers to simulate the arrival of notifications. Once content creators are satisfied with their parameter configurations, they can save this information as presets. In Table
1, we showcase an example collection of presets gathered for various urgency scenarios. Presets can subsequently be used at run-time to adapt any ringtone to any background music automatically.
Input. The system takes audio files (mp3, wav) of ringtones and user music as inputs. It does not require manual processing or annotation by content creators. One exception is that we manually generated the alternative instrument timbres of the ringtones for our current implementation of timbre transfer. This feature can, however, be automated in future implementations, for example by incorporating learning-based timbre transfer methods et al. [
53].
5.1 Music Information Extraction
We pre-processed and extracted relevant information from the input songs and ringtones using Python 3.10 along with several established Music Information Retrieval (MIR) libraries to extract the musical features on which our adaptation parameter manipulations were applied. The pre-processing steps are executed once per song, and take about one minute in our current implementation. Once all features are extracted, manipulations are performed in real-time.
We extracted the beat onsets (i. e., the beat start times) using ESSENTIA [
14] and downbeats (i. e., the start of bars) using Madmom [
12] to enable beat matching and scheduling. To support key matching, we estimated the key of the input audio using ESSENTIA [
14]. To support fine-grained volume adjustments, we isolated the input songs into vocal and instrument tracks with Spleeter [
28]. Lastly, we computed the input songs’ left and right channel intensities and notifications with Librosa [
43] to support panning.
5.2 Real-time Ringtone Feature Manipulation
Our system interface for performing ringtone feature manipulations is implemented in Unity 2021, shown in Figure
7. The individual controls described in section
5 are implemented as part of a Unity Component exposed in the Editor. The real-time manipulation of ringtone features was implemented using Unity, which provided a versatile platform for audio feature processing. Leveraging Unity’s built-in audio engine, we executed essential sound modifications, including play scheduling, volume attenuation, panning adjustments, and pitch stretching. More advanced features such as frequency filtering, reverb dry/wet level, and decay time control were implemented by automating Unity mixer channel settings via C# scripts.
5.2.1 Beat matching.
MARingBA estimates the average tempo of the song and time-stretches the ringtone to match that tempo based on the beat onset information. Unity does not have a dedicated time stretch feature. Any alteration in speed would thus lead to unintended changes to pitch (e. g., speed up leads to higher pitch). To control speed independent of pitch, we utilized Unity’s AudioSource.Pitch feature, which adjusts both the speed of the audio clip and the pitch of the audio clip (e. g., speeding up a track by a factor of 1.2 also raises its pitch by the same factor). We counteract unintended alterations in pitch produced in the previous step by using a separate Unity mixer channel setting that adjusts only pitch without affecting speed. Specifically, by configuring the pitch of the mixer channel to the inverse of the time shift factor, calculated as 1/AudioSource.Pitch. Since some songs that are performed live fluctuate slightly in tempo, we dynamically re-align with the beat onsets every time the ringtone repeats itself. All calculations are automated, content creators can toggle this feature on or off.
5.2.2 Key matching.
Using the key estimation, MARingBA first finds the least amount of pitch shift required to avoid dissonance. It then adjusts the pitch value of the Unity mixer channel to the target pitch. Due to the nature of ringtones being a short melody instead of a fully orchestrated song, the pitches used in the ringtone may fit into multiple different keys without causing dissonance. We exploit this quality in our implementation of key matching by iterating through all the different possible keys to shift to and finding the least pitch-shift required to avoid dissonance instead of pitch-shifting to match the exact key the song is in. All calculations are automated, content creators can toggle this feature on or off.
5.2.3 Scheduling.
Our system enables content creators to define whether a ringtone should be scheduled either to (1) the closest start of beat, (2) start of bar, (3) or start of four bars. Timestamps of beats and downbeats (start of bar) are estimated using MIR libraries. We assume that the first downbeat of the song is the start of a section and keep an internal counter to label every fourth beat after that to be the start of four bars. While this assumption worked well for the approximately two dozen popular songs we tested, it might not hold for all music. We hope to implement techniques to overcome this in the future. Scheduling is automated, and content creators choose between the different modes through a drop-down menu.
5.2.4 Panning.
We leverage a beat-by-beat comparison of root-mean-square (RMS) values between the left and right channels to determine if one has a louder volume. Positive values indicate a higher volume on the left channel, and negative values indicate a higher volume on the right channel. MARingBA pans the ringtone sound using Unity’s AudioSource.panStereo, setting it to the value of volume difference on the current beat. In our system, content creators control the panning of notifications using a simple slider, with values between 0 and 100. The slider value is a multiplier to the balance data of the current song. A value of 0 will keep the sound of the ringtone centered. If one side is louder at the time a notification is played, MARingBA will play the notification on the other side by multiplying the difference in RMS value with the slider value to determine the volume of the ringtone.
5.2.5 Timbre.
In our current implementation, instrument transfer is enabled by pre-rendering (i. e., manual transcription by the content creator) the ringtone in different instrument tones (piano, violin, or synthesizer) and selectively unmuting only one instrument track at a time based on the user-selected instrument. As mentioned above, this manual approach could be automated through learning-based timbre transfer techniques [
53]. Reverb and filtering are both based on Unity’s mixer channel effects. The Unity Audio SFX Reverb effect is used for reverb; highpass and lowpass effects are used for frequency filtering. These effects are added to the same mixer channel that the notification is routed to.
In addition to instrument transfer, content creators can control the decay time and dry/wet level of the ringtone. Decay time controls how long it takes for the reflected sound to decrease in intensity, which dictates the perceived space of the reverb simulation. A longer decay time simulates a larger or more reflective space, whereas a shorter decay time simulates a smaller or less reverberant space. Dry/wet level balances the volume between the simulated reflected sound and the original non-reverberant sound.
As final timbre manipulation, content creators can filter frequencies, including high-pass and low-pass cutoffs.
5.2.6 Volume.
Notification, accompaniment, and vocals are each routed to their own mixer channel. The volume parameter of each channel are manipulated based on different parameters. Track-specific attenuation settings decrease the volume of specific tracks when the ringtone is played (e. g., decrease volume). Content creators can toggle whether vocals and accompaniment are audible when the ringtone is played. We can further control the rate of gradual fade-in of both the ringtone and the background music by setting a fade-in speed value. When toggled on, the fade-in speed slider (0.01 to 0.2 seconds) can be used to control how long it takes until the ringtone resumes with its original volume. The volume of the ringtone is controlled using sliders, and can be generalized across different songs by enabling the adaptive volume setting. This boosts the volume of the ringtone when the music is loud.
6 Notification Elicitation Study
We first conducted a study investigating if our manipulation parameters enable the creation of tailored notification adaptations for diverse usage scenarios. We recruited six musicians and music enthusiasts to use MARingBA to integrate two notifications into three songs for three scenarios. Each scenario required a different level of urgency in response, ideally leading to designs with a corresponding attentional draw or noticeability, as suggested in prior research [
26]. The goals of this study were to elicit qualitative perspectives on our approach and to collect an initial example set of parameter values for different urgency scenarios (Table
1). The latter served as the basis for our second study with end users (section
7), which preliminarily examined its effects on user behavior.
6.1 Procedure
Participants were asked to create notification adaptation parameters for three scenarios with different levels of urgency:
•
Low urgency: A user receives a notification about the temperature on the following day.
•
Medium urgency: A user receives a reminder about an upcoming meeting in two hours.
•
High urgency: A user receives an email from their supervisor that requires an immediate response.
Participants authored parameters for two notifications for each scenario, resulting in six parameter sets per participant. Each parameter set had to integrate the target notification into three songs per the scenario requirements.
The notifications were randomly selected from six ringtones curated from popular mobile devices and applications (e. g., iPhone marimba ringtone, Skype). While we did not explicitly search for notifications that differed in character, the final set represented a range of tempos (130 − 180 bpm, M = 158 bpm, SD = 22) and keys (4 represented).
Similarly, the songs were randomly selected from a set of 12 popular songs with at least 52M views on YouTube, spanning a range of years (1967-2020) and genres (e. g., R&B, hip hop, alternative rock, funk). The songs also represented a diversity of tempos (85 − 161 bpm, M = 126 bpm, SD = 20) and keys (9 represented).
Participants first completed a consent form and a demographic questionnaire for our study. Following this, they were introduced to our system and the adaptation parameters and given time to experiment with the application controls to familiarize themselves. Afterward, the notification scenarios were introduced in a counter-balanced order, and participants were asked to create notification adaptation parameters accordingly. During the tasks, the participants were instructed to follow a think-aloud protocol facilitated by the experimenter. Upon completion of all tasks, we conducted a semi-structured interview to gather their insights on their (1) approach and experience, (2) impressions of the concept of adaptively embedding notifications in music, and (3) suggestions for additional parameters.
6.2 Participants and apparatus
We invited six participants from a local university (6 male, age: M = 26 years, SD = 1). All participants had substantial musical experience (M = 13 years, SD = 5). One participant is a freelance performer, composer, and instrument manufacturer. Two participants had experience DJing and composing electronic music. One participant is a part-time jazz pianist. Two self-reported as music hobbyists. Participants received a $30 gift card as compensation.
The study was conducted using our MARingBA system implemented in a Unity 2021 Editor running on a MacBook Pro (macOS Ventura 13.4, 2.4 GHz Quad-Core Intel Core i5, stereo speakers with high dynamic range). All sessions were audio and video recorded. We recorded all final parameters for each condition.
6.3 Qualitative Feedback
Overall, participants recognized the potential benefits of blending notifications with music, especially in low-urgency and medium-urgency scenarios. They emphasized the importance of achieving a balanced integration, blending the notifications well while ensuring they remain perceivable amidst the music. While exploring various parameters, participants reported that beat matching, key matching, fade-in, and volume control for notifications and music are the most prioritized factors for achieving the desired effect.
While timbre transfer, reverb, and low/high pass were also prioritized by multiple participants, they also expressed concern that these parameters may fail to generalize across different songs. While a certain instrument timbre blends well with the instrumentation of one song, it may create vastly different effects when integrated into another song, making it hard to control when defining settings for a certain level of urgency. Additionally, participants highlighted the significance of retaining the original timbre for notifications, particularly in high-urgency scenarios, as it was strongly associated with the familiarity of the ringtone.
The study demonstrated that our system provided sufficient coverage of parameters, allowing participants to fine-tune the integration of notifications based on different musical contexts and urgency levels. However, a few participants expressed challenges in generalizing parameters across different songs or sections of the same song, where tempo, key, instrumentation, and volume can vary significantly. This revealed a potential need for more adaptive parameters that can respond to various musical contexts.
Another noteworthy advantage of our system was its added choice and customization options. Before using MARingBA, many participants mentioned that they often turned off notifications altogether for low-urgency scenarios to avoid disruptions. However, with the introduction of the blending notifications approach, content creators were open to enabling notifications for low-urgency situations without the fear of being startled by disruptive sounds. We believe this points to a good balance between staying informed and preserving their listening experience, leading to increased overall satisfaction with the notification system.
7 Initial Usage Study
Results from the notification elicitation study suggest that MARingBA shows promise in enabling the integration of notifications into music with varying degrees of urgency. To further understand the benefits and limitations of adapting notifications to a background musical context, as well as to preliminarily validate that different notification adaptations may have a tangible effect on end-user task performance and experience, we enlisted 12 participants to test notifications while performing a typing task. The notifications were adapted using the parameters obtained from our design study. Our study investigates the following research questions: (RQ1) to what extent do different adaptations modulate noticeability? (RQ2) to what extent do our notification adaptations affect a user’s task performance? (RQ3) what aspects do users like and dislike about our approach to adapting notifications?
7.1 Design
We used a single-variable within-subject design with four
adaptation methods (
standard, low urgency, medium urgency, high urgency). Inspired by previous research on interrupts (e. g., [
19,
20]), we adopted a dual-task paradigm where participants performed a
primary task of typing while listening and responding to audio notifications manipulated with our
adaptation methods as a
secondary task. Participants experienced notifications adjusted using each
adaptation methods twice, each time for a different
notification and
song (i. e., 4
adaptation method × 2
notification-
song pairs = 8 repetitions).
notifications and
songs were randomly selected from the same pool as in the initial design study. We removed two songs since one was shorter than 3 minutes, and one had a section with very drastic tempo changes, which may lead to unexpected adaptation results. We counterbalanced the order of
adaptation methods using a Latin Square.
7.1.1 Adaptation method.
To generate the parameters (Table
1) for the
low urgency,
medium urgency, and
high urgency adaptation method conditions, we used the mean of the parameter settings generated from our design study (i. e., from the six participants) for continuous values and the mode for categorical values. Note that the parameters do not necessarily represent optimal settings for each condition, but rather a first example set. For the
standard condition, we set the notification to mute the background music following conventional delivery approaches (section
3.2).
7.1.2 Primary task.
Participants were instructed to transcribe articles from Wikipedia as quickly and accurately as possible. The interface is shown in Figure
8. As visual feedback for the task, completed words are highlighted in green, typed letters of the current word are highlighted in yellow, and incorrectly typed characters turned the current letter red. Participants performed the typing task for three minutes in each condition.
7.1.3 Secondary task.
While participants performed the typing task, they were asked to monitor for audio notifications simultaneously. Participants were instructed to click a button (see Figure
8, top right) when they noticed a notification as soon as possible. The button temporarily turned black to provide visual feedback for the response. In each condition, three notifications were delivered at randomized intervals at least 40 seconds apart, with the earliest and latest appearance at 0:10 minutes and 2:50 minutes, respectively.
We designed our study for a scenario where end users listen to their personal music collection. We therefore asked all users to rank the familiarity of the twelve songs used in the first expert study and used songs that were familiar to them. We also made sure to cover a wide range of songs, and thus did not always select the most-familiar songs.
7.2 Procedure
After participants completed a consent form and a demographic questionnaire, they were introduced to the study tasks. Subsequently, they completed the study conditions, which were structured into two blocks. Each block consists of four repetitions of the dual-task paradigm for a randomly selected song-notification pair. In each repetition, a different adaptation method was applied. Participants responded to questionnaires at the end of each condition. They also ranked the conditions by preference at the end of each block. After completing all conditions, participants reported their overall experience with the different notification delivery mechanisms.
7.3 Participants and apparatus
We recruited twelve participants from a local university and convenience sampling (7 male, 5 female, age: M = 24.67 years, SD = 3.75). Participants listened to music daily (M = 3 hours, SD = 1) during activities like work (N = 10) and exercise (N = 8), as well as during their commute (N = 10). Participants received a $15 gift card as compensation.
We integrated the typing and notification response tasks into our MARingBA system, implemented in Unity 2021. The study was conducted on a MacBook Pro (macOS Ventura 13.4, 2.4 GHz Quad-Core Intel Core i5) with a pair of AKG K240 Studio over-ear headphones.
7.4 Measures
We captured participants’ primary and secondary task performance and subjective experience as dependent variables.
•
Primary task performance: We measured typing errors, i. e., the number of incorrectly typed keys, and resumption lag, i. e., the time between notification detection and resuming to a regular typing speed.
•
Secondary task performance: We measured reaction time as the elapsed time between the start of the notification and the participant’s response, as well as the number of missed notifications.
•
Self-reported metrics: At the end of each condition, participants reported their confidence and immediacy in detecting the presented notifications, and characterized the notifications in terms of their noticeability and distraction, all on a scale from 1 (low) to 7 (high). At the end of each block, participants provided a preference ranking of the adaptation methods.
7.5 Results
For effect analysis, ordinal data (questionnaire ratings and rankings) were analyzed using Friedman tests and Wilcoxon signed-rank tests for post-hoc analysis when needed. Interval data (typing errors, resumption lag, reaction time, and number of missed notifications) was analyzed using a series of repeated-measured ANOVAs. In cases where the normality assumption was violated (Shapiro-Wilk test
p < .05), we applied an Aligned Rank Transform (ART) before performing our analysis [
52]. When needed, pairwise post-hoc tests (Bonferroni adjusted p-values) were performed. For each variable, the
participant was considered a random factor and the
adaptation method as a within-subject factor. The statistical analysis was performed in IBM SPSS Statistics 29.
7.5.1 Primary task performance.
We did not observe significant main effects on typing errors (p = 0.452) or resumption lag (p = 0.861). Participants made M = 46.94 typos, SD = 21.18, and exhibited a resumption lag of M = 8.68 s, SD = 1.90. Individual conditions were within \raise.17ex\(\scriptstyle \sim\) 2% of the mean for typos and \raise.17ex\(\scriptstyle \sim\) 4% of the mean for resumption lag.
7.5.2 Secondary task performance.
Questionnaire responses are shown in Figure
9. Across participants, a total of 288 notifications were delivered during the experiment. We found a main effect of
adaptation method on
reaction time (
F3, 33 = 11.612,
p < .001). On average, participants were significantly faster at responding to notifications in the
standard (
M = 16.29
s,
SD = 3.69,
p < 0.001) and
high urgency (
M = 20.29
s,
SD = 4.30,
p < 0.012) conditions compared to the
low urgency condition (
M = 34.33
s,
SD = 2.314). Participants missed three notifications in the
low urgency condition and one notification in the
medium urgency condition. This highlights that all types of
adaptation methods deliver notifications that are noticed, albeit later reaction time for lower-urgency ones.
7.5.3 Self-reported metrics.
We found a main effect of adaptation method on confidence (χ2(3) = 14.234, p = 0.003), immediacy (χ2(3) = 24.786, p < 0.001), noticeability (χ2(3) = 24.677, p < 0.001), distraction (χ2(3) = 24.636, p < 0.001), and ranking (χ2(3) = 13.622, p = 0.003).
Compared to the standard adaptation method, participants first regarded both the low urgency and medium urgency adaptation methods as less noticeable (low urgency: p = 0.03, medium urgency: p = 0.042) and less distracting (low urgency: p = 0.012, medium urgency: p = 0.042). They also reported responding to the standard adaptation method more immediately than the low urgency adaptation method (p = 0.03). Compared to the high urgency adaptation method, participants regarded the low urgency adaptation method as less noticeable (p = 0.03) and less distracting (p = 0.03). They also reported responding to the high urgency adaptation method more immediately than the low urgency adaptation method (p = 0.03). Lastly, the reported preferring medium urgency adaptation method over the standard adaptation method (p = 0.042).
7.5.4 Discussion.
Within the context of the songs and notifications we evaluated, participants were generally highly successful in detecting notifications. While this can partly be attributed to the fact that they were aware that notifications would happen (i. e., cued detection), it nevertheless confirms that even notifications designed for low urgency are generally noticeable. Participants’ recognition speed and subjective ratings on perceived noticeability largely align, with standard and high urgency being rated more noticeable than the other two conditions. The ratings on distraction are directly related to noticeability, with standard and high urgency being perceived as more distracting. Qualitative comments, however, also point in a different direction, where low noticeability is also perceived as distracting, as noted by one participant “I liked the way that the earlier styles [low urgency] were woven into the music rather than causing the music to stop or grow very faint, but depending on the song this might be more annoying and could be rather hard to detect.” (P9) Participants generally preferred a balance between noticeable and distracting, seemingly best fulfilled by the medium-urgency condition, as reflected in comments such as “While the more loud ones definitely caught my attention, I think that the more in-between notification sounds were ideal.” (P5)
Participants could also identify, or at least appreciate, certain manipulations MARingBA performed. One participant reflected on how they enjoyed volume adjustments, including track-specific volume attenuation and fading: “I like it when the notifications come in, the music volume will be slightly reduced, and the notification volume gradually increases. I don’t like the places where the notifications just kick in without easing.” (P7) Finally, even though some participants could not identify certain features of our approach, they perceived beat and key matching as desirable, as noted by one participant: “I do not know if it was an accident, but one of the notification sounds came up exactly on the same beat of the Happy song, and I felt that it was actually nice for the notification to enter the song in the same tempo.” (P12). These results suggest that our approach may enable a better balance between noticeability and distraction and that participants preferred our music-adaptive notifications over standard delivery.
9 Discussion
We contribute a novel music-adaptive approach to delivering audio notifications such as ringtones. Our work explores the design space of possible audio manipulations such as beat matching, key matching, or timbre adjustments. We integrate those manipulations into MARingBA, a novel system that enables content creators to design adaptive audio notifications. We explore our system in an elicitation study with experts who design a variety of ringtones for different songs. Insights from the study indicate that MARingBA enables them to explore the parameter space efficiently. We use the parameters obtained in the design study in a preliminary evaluation with end users. Results indicate that in the context of the songs and notifications we tested, our example set of music-adaptive audio notifications provided users with noticeable signals with varying reaction times and are preferred over a standard delivery baseline. We believe that music-adaptive notifications are a feasible complement, or even replacement, for current notification delivery methods. There exists, however, still unexplored areas of potential challenges and opportunities that we hope to explore in the future.
9.1 Generalizability
In our studies, we generated and tested an initial set of parameters for integrating notifications into different musical contexts. While the studies indicate that the approach may serve as a promising approach for modulating the noticeability of audio notifications, future work remains to identify optimal parameters for various urgency contexts and to ensure our approach’s generalizability.
First, to obtain the parameters for the various urgency conditions in our second study, we computed the mean for continuous values and determined the mode for categorical values. This process yielded an initial set of parameters that appeared to have the desired effect on users in a preliminary evaluation. Whether the presented parameters are necessarily the correct values for the various urgency conditions is still being determined. Additional empirical studies that test the parameter effects in a more controlled manner are needed to tease out their individual effects and interactions.
Second, while we aimed to curate a diverse set of notifications and songs, further experimentation is needed to ensure the generalizability of different parameter settings. It is important to note that the songs we used in our evaluation followed conventional Western tuning, consistent rhythm, and a standard key signature. Therefore, whether our findings and approach will generalize to songs that do not follow these conventions needs to be clarified. To explore this challenge, we tested our system with input songs that included complex and uncommon harmony structures, key changes, and time signatures. MARingBA managed to synchronize ringtones to the examples with complex harmony structures and key changes somewhat successfully. The notifications may still be perceived as out of place due to their specific musical qualities, such as instrument timbre and harmonic consistency. Additionally, if the time signatures of the music are highly irregular, our approach cannot reliably extract the beat. This makes integration challenging and can lead to sub-optimal placement of notifications. Future work can address these challenges by developing better MIR approaches to detect unconventional time signatures and harmony on a chord-by-chord basis. Allowing content creators to customize parameter settings for specific genres, intensities, and time signatures also enables our approach to be more broadly applicable.
9.2 Pre-processing of input music
Our current implementation requires pre-processing every input song, which takes approximately one minute per song. Since pre-processing only needs to occur once for each song, this information can be made available to systems such as MARingBA by the on-device music library by processing the upcoming song or sent to the device in the metadata by any streaming service. This overcomes any challenges of real-time usage and makes the necessary information available in seconds. The amount of metadata required for each song may differ based on its tempo and duration, but we estimate it to be less than 10 kB per song. This metadata includes information like beat and downbeat onset, tempo, key, and panning. If source-separated stems are used, their size will be similar to that of the original song.
9.3 Additional parameters
We provide an exploration of the design space of adaptive sound notifications. We hope that our work serves as a foundation for future explorations. While the parameters we have explored offer significant coverage, there is still room for further exploration. For instance, we have yet to explore common audio effects used by music producers, such as delay, distortion, and chorus. Additionally, we still need to explore changing the ringtone to match the music’s genre or manipulating the ringtone’s original melody. We plan to dive deeper into this design space in the future and explore the benefits and limitations of the individual parameters.
9.4 Personalization
The settings we tested and the parameters of the audio notifications that the experts in the design study created worked well across the different songs and led to desirable task performance. Low-urgency notifications were perceived later than high-urgency ones and perceived as less distracting. In the qualitative results, however, we saw that several participants preferred the low-urgency notifications, whereas others wanted more salient signals like the medium-urgency settings. This hints at opportunities for personalization. Similar to current ringtones in phones, we believe that future versions of our approach should allow users to personalize their settings, and give end users agency in the design of the notifications. We are eager to explore the granularity of such personalization, from a single value parameter for “strength”, to giving users control over individual parameters such as timbre or key matching.
9.5 Multi-modal notifications
Our current approach is focused on audio notifications. Current interactive systems, however, deliver content and information through a wide range of modalities, from visual such as desktop or Mixed Reality settings, to haptic notifications through vibrations on a smartphone, or smell and taste. We plan to pair music-adaptive audio adaptations with other modalities in the future to investigate how well those can integrated into a multi-modal delivery mechanism. Additionally, we hope to explore applying our adaptive approach to other modalities. One could easily imagine an approach where the vibration of a haptic notification for a phone call is synchronized to the music that a user is currently listening to; or that visual notifications in Mixed Reality appear in a style that matches the musical genre. We are excited to explore those combinations in the future and find out what modalities are best suited for musical adaptations to be less disruptive, or potentially enhance the music listening experience for users.
9.6 In-the-wild studies
We currently evaluate our approach with end users in a highly controlled environment, where we control the space, task, and what music participants were listening to. We hope to expand our evaluation to more users and more context in the future. Other contexts such as tasks (e. g., running, shopping) and spaces (e. g., indoor, outdoor) will inevitably influence participants’ ability to detect audio notifications such as ringtones. Additionally, we believe that the question of notification scheduling is interesting for further investigation. For example, delaying notifications to opportune moments in later sections of the song, or even the next song, might be desirable in scenarios of focused work to minimize switching costs. In other scenarios, such as phone calls, the delivery cannot be delayed drastically, since the caller might hang up. We hope to explore this aspect in the future. Finally, we plan to expand our music-adaptive approach with physiological sensing [
22] to balance noticeability, urgency, and distraction. Our music-adaptive content delivery is a first step towards context-aware delivery of audio notifications that are noticeable, not distracting, and ultimately beneficial for end users.