Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3613904.3642376acmconferencesArticle/Chapter ViewFull TextPublication PageschiConference Proceedingsconference-collections
research-article
Open access

MARingBA: Music-Adaptive Ringtones for Blended Audio Notification Delivery

Published: 11 May 2024 Publication History

Abstract

Audio notifications provide users with an efficient way to access information beyond their current focus of attention. Current notification delivery methods, like phone ringtones, are primarily optimized for high noticeability, enhancing situational awareness in some scenarios but causing disruption and annoyance in others. In this work, we build on the observation that music listening is now a commonplace practice and present MARingBA, a novel approach that blends ringtones into background music to modulate their noticeability. We contribute a design space exploration of music-adaptive manipulation parameters, including beat matching, key matching, and timbre modifications, to tailor ringtones to different songs. Through two studies, we demonstrate that MARingBA supports content creators in authoring audio notifications that fit low, medium, and high levels of urgency and noticeability. Additionally, end users prefer music-adaptive audio notifications over conventional delivery methods, such as volume fading.
Figure 1:
Figure 1: We present MARingBA, an approach that automatically blends audio notifications into the music users’ are listening to. The user is reading a book when an audio notification arrives. Our system adapts the notification to match the beat and key of the background music, to create a less disruptive experience.

1 Introduction

Audio notifications, like ringtones, play a crucial role in how users receive information and are widely employed by modern digital devices, especially mobile phones, to relay time-sensitive updates, including incoming calls, upcoming calendar events, or messages. Presently, most auditory notifications are intentionally designed to be highly noticeable, ensuring users don’t miss them. While this design effectively captures the user’s attention, it can also lead to disruptions in various contexts.
Consider this scenario: a user is engrossed in a book while enjoying their favorite song (Figure 1). In the event of a notification, most digital devices will automatically lower the background music volume and deliver the alert audibly. While this approach guarantees that the user detects the notification, it often proves distracting and disruptive to the user’s music-listening experience. This is particularly true for non-urgent notifications that require a timely but not necessarily immediate response by end users.
Previously, researchers have addressed this challenge through methods like substituting notifications with audio effects [5, 9], integrating music snippets in pre-composed music soundscapes [16, 35], and embedding ringtone sounds in single-timbre music using timbre transfer techniques [53, 54]. While these prior approaches have shown promise in creating more musically integrated notifications, they are subject to two significant limitations. First, they often rely on completely replacing familiar notifications, which can diminish recognizability and user comfort. Second, many methods are customized for predefined musical contexts or require songs to be composed with notifications in mind, limiting their adaptability to the diverse music preferences of today’s users.
In our work, we introduce a new method that seamlessly incorporates audio notifications into users’ musical experiences. Inspired by digital music practices such as disc jockeying and remixing, along with concepts from music information retrieval (MIR), we provide an initial exploration of the parameter design space for auditory manipulations to create music-adaptive audio notifications. These parameters, including beat matching, key matching, and timbre modifications, are designed to facilitate the seamless integration of notifications into musical sequences while allowing for customizable degrees of blending. Our work lays the foundation for future, more exhaustive explorations of the design space of music-adaptive notifications.
To validate these parameters and investigate the user experience of a more music-adaptive approach to delivering audio notifications, we further developed MARingBA, an interactive system enabling real-time manipulation of notifications. MARingBA  is designed for content creators and designers of audio notifications. MARingBA incorporates a suite of automated mechanisms for extracting music information and serves as a prototype interface for experimenting with and creating music-adaptive notifications using our design space parameters. MARingBA  uses a novel combination of established techniques in music computing to enable the rapid exploration of music-adaptive notifications. With its various parameter settings, content creators and designers can quickly define ways of automatically integrating ringtones into multiple songs.
Through two studies, we gather insights into our approach, design space, and system from the perspective of two main stakeholders: (1) content creators responsible for designing audio notifications, and (2) end users who may receive these notifications in the future. Our initial study with six music experts revealed that our design space is highly expressive, enabling them to tailor notification designs to diverse contexts. They were notably able to use MARingBA to blend notifications with multiple songs and accommodate various noticeability and urgency requirements (e. g.,  designing for casual weather alerts versus work scenarios requiring an immediate response). In a second experiment with end users, we preliminarily evaluated whether our parameters could modulate the noticeability of audio notifications while producing a preferred user experience to standard notification delivery mechanisms.
In summary, we make the following contributions:
An initial design space exploration of parameters for adapting notifications to a background musical context in a harmonic manner, which also enables the modulation of its noticeability,
MARingBA, a system that implements our design space parameters for authoring music-adaptive notifications by leveraging a novel combination of techniques from music computing,
Insights from a study with content creators (n = 6) on the utility of our parameters and system for creating music-adaptive notifications,
Results from an initial usage study with end users (n = 12) showing that an example set of adaptation parameters yielded notifications that are preferred over a standard volume-fading baseline while exhibiting controllable detection rates.
In addition to detailing our contributions, we provide a background section of well-established fundamental music concepts in Section 3. Audio examples of ringtone blends generated by our system can be found at https://augmented-perception.org/publications/2024-maringba.html.

2 Related Work

Our work builds on relevant related work from the fields of digital notifications, audio notifications, and music information retrieval.

2.1 Digital Notifications

Notifications are a ubiquitous feature of modern digital devices and a longstanding topic of interest within HCI research [38, 47]. By proactively delivering visual, haptic, or auditory alerts, notifications serve as an efficient way to convey information to users from sources outside their primary focus of attention [44, 47]. Early research has shown that notifications may benefit users’ informational awareness [44]; however, if presented at an inopportune time or at an inappropriate frequency, notifications become a source of disruption and annoyance [1, 8, 13]. Disruptions from such notifications lead to errors [1], anxiety [8], and productivity loss [13]. As the proliferation of digital interfaces generates an ever-increasing volume of notifications, researchers have continued to study and improve notifications in various devices (e. g.,  multi- and cross-device ecosystems [18, 51], smart homes and intelligent living environments [37, 50], virtual reality [25]).
One substantial subset of the existing literature has investigated when notifications should optimally be delivered. For instance, early work by Czerwinski et al. [20] and Horvitz [29] found that the disruptiveness of notifications depends on their contents, the nature of the task that the user is engaged in, as well as the user’s level of engagement. Related research also found that scheduling notifications at natural task breakpoints reduces the cost of interruption [2, 7]. Consequently, Bailey and Konstan [7] and Iqbal and Bailey [32] have suggested that it may be valuable to imbue devices with some level of awareness of the user’s task structures and attention. Iqbal and Bailey [33] implemented a computational approach that operationalized these insights in the desktop domain. Hudson et al.’s Wizard of Oz study demonstrated the potential value of making such predictions about interruptibility with sensors [30].
In prior research, there is also a complimentary line of work investigating how notifications should be designed. Arroyo et al. [6], for instance, found an interaction effect between the modality in which a notification was presented and participants’ prior experiences on its effectiveness. In the desktop domain, Müller et al. [45] found that users’ ability to detect a notification depends on its background and placement on the screen. Prior research generally agrees that the noticeability or attentional draw of notifications should be proportional to their utility [26, 40, 41, 42, 46].
Our work primarily aims to innovate how notifications can be delivered. Drawing inspiration from Gluck et al. [26] in particular, we introduce an approach to modulating the attentional draw of notifications by embedding them within background music to varying degrees. Our approach intends to serve as a foundation for adaptively curating notifications based on their potential utility to the end user in future applications.

2.2 Audio Notifications

In the realm of auditory information presentation, two prominent methods have emerged: auditory icons as proposed by Gaver [24], and Earcons as introduced by Blattner et al. [11]. Auditory icons leverage real-world sounds corresponding to their virtual function, further conveying multi-dimensional data by modulating various sound qualities. In contrast, Earcons consist of composed sequences with no inherent association to their representation, necessitating users to learn the connection.
Both Earcons and Auditory icons have been explored to deliver notifications to users via mobile devices [23]. They have demonstrated efficacy in critical contexts [27] and in enhancing task performance [39]. Nevertheless, while audio notifications can effectively alert users, they may also introduce irritations and disruptions [15]. We aim to alleviate the disruption induced by audio notifications through their integration with musical elements.
Prior work aimed to make audio notifications less intrusive with various methods. Jung and Butz [16, 35] modified pre-composed music to alert users about incoming information. Their modification included adding or omitting specific instruments to deliver target notifications. Their approach requires music to be composed with the notifications in mind, while ours generalizes to more music. Ananthabhotla and Paradiso [5] substituted audio notification content with audio manipulations on the music itself in their SoundSignaling system to deliver audio notifications. They introduced subtle signals that rely on users’ familiarity with a song to detect a notification. Since the manipulations are very subtle (e. g.,  temporarily changing the rhythm of a song slightly), they are limited in the types of notifications they can deliver. Finally, Yang et al. [53, 54] used timbre transfer to embed ringtones into music. However, their approach still follows a conventional delivery format where the music track itself is faded out, or muted, during the ringtone notification. Their approach does not consider other essential manipulations such as key matching or beat matching to integrate a notification better into music. We take timbre transfer as one building block in the larger design space of music-adaptive audio notifications.
Barrington et al. [9] leveraged audio notifications as standalone ambient displays, manipulating an existing music track to communicate human affect. Kari et al. [36] modified songs to align with the affordances of car rides. The manipulations of both approaches inspired how audio is processed in MARingBA.

2.3 Sound and Music Computing

Our work relies on Music Information Retrieval (MIR) techniques, particularly for deriving quantifiable features from audio signals. Within MIR, audio manipulation is used to create DJ systems [34], mash-up systems [21], and automatic music mixing systems [48] that aim to support human experts in their respective domains. By finding the beat, key, downbeat, and structural segments of each individual song, these automatic systems are capable of making the necessary audio manipulations and selection of pre-existing music tracks to create seamless mixes. MARingBA  shares its underlying techniques, such as beat or key matching, with commercial DJ software and digital audio workstations (e. g.,  Algoriddim [4], FL Studio[31]). While some DJ software offers automatic song transitions (e. g.,  Algoriddim AutomixAI [3], VirtualDJ [49]), they focus on transitions from the current song to the next one, rather than maintaining a continuous blend throughout a ringtone’s duration and smoothly reverting to the original track. Although similar effects can ultimately be achieved with commercial software through manual composition, MARingBA  significantly simplifies the process through the parameterization of automated modifications. This makes it more convenient for designers to prototype and test settings to match their desired usage, such as varying notification noticeability and timeliness, and re-use settings across songs. MARingBA  differs from automated DJ features in terms of application (live performance versus end-user experience) and musical output (song transitions versus integration of monophonic melody across multiple songs).

3 Background

We first provide a summary of background information on several well-established fundamental music concepts relevant to our work. This section can be skipped by knowledgeable readers. For a more comprehensive introduction to music theory and computer music, we refer interested readers to Blatter [10] and Collins [17] respectively. Additionally, we describe the limitations of current audio notification delivery approaches.

3.1 Relevant music concepts

Music concepts relevant to our work include tempo, pitch, note, and key. Throughout the paper, we illustrate these parameters similar to Figure 2. Horizontal bars represent individual notes, played in a specific key (y-axis). The notes follow a certain tempo (i. e.,  beat), indicated by the grid cells.
Tempo refers to the speed or pace at which a piece of music is performed, typically measured in beats per minute (BPM). This intuitively corresponds to the rate at which people naturally tap their feet when they listen to music.
Pitch refers to a sound’s perceived highness or lowness, which is determined by the frequency of its vibrations and measured typically in hertz (Hz). Higher frequencies correspond to higher pitches, and lower frequencies correspond to lower pitches. In music composition, sounds of different pitches are referred to as notes.
A key refers to a set of pitches or notes. Songs and musical compositions typically adhere to notes belonging to a single key (e. g.,  C major). Introducing additional off-key notes typically results in undesirable dissonance (i. e.,  keys are not in harmony), with notable exceptions in experimental music and jazz, for example, in which dissonance is carefully used as a design element.
Timbre is a broad term used to describe the unique characteristics of a sound that differentiate it from its pitch and is colloquially described as the “quality of a musical note or sound.” It is determined by the combination of overtone frequencies of a sound. An effective way to grasp timbre is by considering, for example, how a guitar and a violin can play the same music at the same pitch and intensity yet sound distinct from each other.
Figure 2:
Figure 2: Non-adaptive integration of ringtones through muting and overlaying. Top: Integrating a ringtone (yellow) into music (green) by muting the song. The blacked-out area indicates a volume decrease. Bottom: Integrating a ringtone into music without muting the song. Unaligned tempo and notes of dissonance are annotated in red.

3.2 Audio notifications

On current devices, audio notifications are typically delivered in one of two ways: either they mute whatever the user is listening to or they are directly overlaid. If the user was previously listening to music, the muting approach ensures that the notification is noticed but fully interrupts the user’s listening experience.
Alternatively, if the notification is overlaid, users can still hear the music, albeit quieter if its volume is decreased. The sounds from the two sources, however, may clash and result in an unpleasant experience for the human ear. From a musical perspective, this dissonance can be attributed to several factors, such as misalignments in tempo or key, as illustrated in Figure 2. Most music is composed at a consistent tempo, so introducing a rhythmic sound that doesn’t align with this tempo, e. g.,  because it is faster or slower, can disturb the listener. Similarly, when the pitch of the notification doesn’t match the music’s key, it will be perceived as out of place. Overall, unless an immediate user response is required, both conventional notification delivery mechanisms—mute and overlay—may lead to a sub-optimal experience, as they do not consider the musical context in which the user is situated.

4 The MARingBA Parameter Space

Our goal is to automatically generate ringtone-music blends that resemble the quality of manually crafted mixes by human mash-up artists. To achieve this, we propose an approach centered around defining an initial set of distinct music feature modification parameters. These parameters are grounded in music theory, and inspired by music practices that involve blending multiple musical audio sequences together, such as DJing or sampling. Although our design space exploration is not exhaustive, it offers foundational insights into our approach and points toward potential areas that warrant further investigation. In the following, we provide an in-depth description of these parameters, their conceptual implementation, and their role in achieving effective ringtone-music adaptation.

4.1 Beat matching

Beat matching refers to slowing down or speeding up one or both of the clips until their tempo becomes the same. This technique is used by DJs and mash-up artists to align the tempo of two different songs and create a synchronized mix that listeners can dance to. In addition, beat matching refers to manually synchronizing the onset of beats to align across songs.
Assuming the timestamp of every beat in a piece of music is known, e. g.,  by using rhythm extraction software [14], we first calculate the average interval between all pairwise beats in a song. The average tempo in beats-per-minute (BPM) is then calculated as
\begin{equation} \text{average tempo (BPM)} = \frac{60}{\text{average interval}} \end{equation}
(1)
We can then use the tempo estimation to beat-match the notification audio to synchronize with the music by calculating
\begin{equation} \text{time-stretch amount} = \frac{\text{average tempo music}}{\text{average tempo ringtone}} \end{equation}
(2)
Figure 3:
Figure 3: Top: a naive implementation. The red lines here represent the beat timings of the ringtone (yellow) and are unaligned with the music (green). Bottom: the ringtone is beat-matched by stretching the audio to be played at a slower rate. This perfectly aligns the beats with the music.
The time stretch amount is then applied to the ringtone to match the tempo of the music. Furthermore, the beat onset of the ringtone is aligned with the beat of the music. Figure 3 illustrates a naive implementation with misaligned beats and a beat-matched version.

4.2 Key matching

When pitches outside of the music’s key are introduced, dissonant clashes can occur. Key matching, sometimes also referred to as harmonic mixing, makes sure two songs that are being blended share a similar key. One simple way to ensure harmony is to shift the pitch (i. e.,  frequencies) of the key of the ringtone to match the key of the song. In Western music theory, frequencies are split into 12 equidistant notes, which are the main building blocks for all keys. The distance between two consecutive notes is referred to as a semitone. It is represented as a frequency ratio of the twelfth root of two. Semitones are the smallest interval in music and can be used as a metric to define other intervals in terms of the number of semitones between them. Key matching through pitch shifting relates to moving between notes by shifting their frequencies with the correct number of semitones.
To achieve this, the key-matched frequency is calculated as
\begin{equation} \text{key-matched freq} = \text{original freq} \times \sqrt [12]{2} \times \text{semitones}. \end{equation}
(3)
As an example, matching a ringtone in the key of B to a song in the key of C, i. e., a difference of one note, the ringtone should be pitch-shifted upwards by 1 semitone (e. g., multiplying the frequency of B by (12th root of 2) results in the frequency of C). As shown in Figure 4, the ringtone does not use pitches that fit into the key of the music. We rectify this by pitch-shifting the ringtone up by one semitone, avoiding dissonant notes and staying in harmony with the music.
Figure 4:
Figure 4: The horizontal red lines represent the pitches that do not fit in the music’s key. Top: a naive implementation. Here, the ringtone pitches (yellow) do not match the music (green) and create dissonance as a result. Bottom: the ringtone is key-matched by pitch-shifting the audio to be played at a higher pitch, where the note pitches are perfectly aligned with the key of the music.

4.3 Scheduling

Conventional notification delivery mechanisms typically alert the user as soon as notifications are received. In a musical context, this may coincide with an undesirable temporal placement where the notification is not aligned with the background rhythm. There are several ways to mitigate this challenge, including waiting with playback until the next beat, next bar, or the start of the next four bars. In music theory, a bar, or measure, is a segment of time that involves the grouping of multiple beats, usually four in mainstream music. Structurally, music often has repetitions and variations that happen at the start of every four bars, making those positions suitable candidates to blend ringtones naturally in the structure of the music.

4.4 Panning

Panning refers to how the audio is distributed across different channels in a stereo sound system (e. g., headphones). A sound that is panned to the left will result in more volume from the left speaker, for example. Panning can be used in music-adaptive ringtones to make the sound stand out as coming from a different direction than other sounds presented in the mix of music. In our approach, the ringtone sound is panned from the side which holds less intensity. The song shown in Figure 5, for example, begins with a sequence that is panned to the left side. A ringtone could be panned to the right to counterbalance the difference in volume, and balance the stereo sound.
Figure 5:
Figure 5: A plot of the stereo balance of a song. Values above 0.0 relate to the left stereo channel. Values below 0.0 relate to the right channel. If a song is around 0.0, the volume between the two channels is balanced, i. e., both have roughly the same volume. Initially, the song is predominantly panned to the left side, which can be used to integrate a ringtone on the right channel, for example.

4.5 Timbre

There exists various way to manipulate the timbre of the music, including instrument transfer, reverb dry/wet level, reverb decay time, and setting high pass and low pass thresholds.
Instrument Transfer. Aside from the original sound of each ringtone, we allow the ringtone’s instrument to be replaced with either a piano, violin, or synthesizer, while keeping the same melody and rhythm as the original. Transferring the instrument tone used to play the ringtone can result in a more integrated mix depending on the instrumentation of the song. The original iPhone ringtone, for example, is played on the Marimba, a type of mallet percussion with African origins. This ringtone might not blend well into synth-heavy dance tracks, and would benefit from being transferred to a synthesizer, for example.
Reverb. Reverb refers to the reflections of a sound from the environment it is played in. More spacious environments lead to more reverb signals and longer decay times. Adding simulated reverb on ringtones helps with music integration, as reverb makes the sound more natural by simulating its presence in a physical space.
Frequency filtering. Frequency filtering is the removal of frequency content from a sound that is rich in frequencies. If there were a significant overlap in frequencies between the ringtone and music, the ringtone would be less noticeable. Removing part of the ringtone’s frequencies can benefit ringtone integration because it can “free up” room for the song, without significantly impacting the ringtone’s volume.

4.6 Volume

Lastly, we can manipulate the volume of the ringtone and music clips in several ways.
Ringtone volume. The volume of a ringtone directly affects how noticeable it is compared to background music. Lower-volume ringtones tend to blend more subtly than high-volume ones.
Fade-in. Fade-in refers to starting at a low volume and gradually transitioning back to a stable level of volume. This transition eases the introduction of the ringtone and makes integration smoother.
Adaptive volume. Rather than setting a fixed ringtone volume, we support configuring it to automatically adapt based on the input music — increasing for louder songs and decreasing for softer ones.
Track-specific attenuation. In addition to adjusting the ringtone volume, we also support making volume adjustments to specific tracks within the background music. For instance, we can temporarily lower the vocals’ volume in a song while maintaining full accompaniment volume. This creates additional room for blending in the ringtone without interrupting the music flow. We note that individual tracks can be acquired either directly from the original mix or automatically extracted using source separation technology (e. g., Spleeter [28]).
Figure 6:
Figure 6: An overview of the MARingBA system. Music is processed with music information retrieval models to extract downbeat, key, tempo, and stereo balance. The data is then loaded into Unity to support audio modifications.

5 The MARingBA System

The goal of MARingBA  is to create music-adaptive ringtones that seamlessly blend with and adjust to any music a user listens to in a contextually sensitive manner. Notification adaptations should account for the ambient musical context and consider various usage scenarios, particularly the urgency of their contents, which arguably determines their desirable level of noticeability. To achieve this, we implemented the adaptation parameters described in section 4. Content creators can utilize these diverse manipulations to specify how and to what degree ringtones should adapt to and blend with any given song, consequently influencing their noticeability. We demonstrate this in our initial expert study and validate the designed ringtones in the preliminary user study.
Our current implementation includes features for extracting music information from the ringtone and background music, and performing real-time notification feature manipulation. An overview of the system is illustrated in Figure 6. Apart from information extraction and audio manipulations, MARingBA  provides content creators with mechanisms to select target notifications and songs for testing the parameters, playback controls, and manual triggers to simulate the arrival of notifications. Once content creators are satisfied with their parameter configurations, they can save this information as presets. In Table 1, we showcase an example collection of presets gathered for various urgency scenarios. Presets can subsequently be used at run-time to adapt any ringtone to any background music automatically.
Input. The system takes audio files (mp3, wav) of ringtones and user music as inputs. It does not require manual processing or annotation by content creators. One exception is that we manually generated the alternative instrument timbres of the ringtones for our current implementation of timbre transfer. This feature can, however, be automated in future implementations, for example by incorporating learning-based timbre transfer methods et al. [53].

5.1 Music Information Extraction

We pre-processed and extracted relevant information from the input songs and ringtones using Python 3.10 along with several established Music Information Retrieval (MIR) libraries to extract the musical features on which our adaptation parameter manipulations were applied. The pre-processing steps are executed once per song, and take about one minute in our current implementation. Once all features are extracted, manipulations are performed in real-time.
We extracted the beat onsets (i. e.,  the beat start times) using ESSENTIA [14] and downbeats (i. e.,  the start of bars) using Madmom [12] to enable beat matching and scheduling. To support key matching, we estimated the key of the input audio using ESSENTIA [14]. To support fine-grained volume adjustments, we isolated the input songs into vocal and instrument tracks with Spleeter [28]. Lastly, we computed the input songs’ left and right channel intensities and notifications with Librosa [43] to support panning.
Figure 7:
Figure 7: The MARingBA system interface. The audio track area visualizes the song through a discretized spectrogram, with bass frequencies at the bottom. Color brightness represents intensity. Content creators can configure parameters using the control panel on the right.

5.2 Real-time Ringtone Feature Manipulation

Our system interface for performing ringtone feature manipulations is implemented in Unity 2021, shown in Figure 7. The individual controls described in section 5 are implemented as part of a Unity Component exposed in the Editor. The real-time manipulation of ringtone features was implemented using Unity, which provided a versatile platform for audio feature processing. Leveraging Unity’s built-in audio engine, we executed essential sound modifications, including play scheduling, volume attenuation, panning adjustments, and pitch stretching. More advanced features such as frequency filtering, reverb dry/wet level, and decay time control were implemented by automating Unity mixer channel settings via C# scripts.

5.2.1 Beat matching.

MARingBA  estimates the average tempo of the song and time-stretches the ringtone to match that tempo based on the beat onset information. Unity does not have a dedicated time stretch feature. Any alteration in speed would thus lead to unintended changes to pitch (e. g., speed up leads to higher pitch). To control speed independent of pitch, we utilized Unity’s AudioSource.Pitch feature, which adjusts both the speed of the audio clip and the pitch of the audio clip (e. g., speeding up a track by a factor of 1.2 also raises its pitch by the same factor). We counteract unintended alterations in pitch produced in the previous step by using a separate Unity mixer channel setting that adjusts only pitch without affecting speed. Specifically, by configuring the pitch of the mixer channel to the inverse of the time shift factor, calculated as 1/AudioSource.Pitch. Since some songs that are performed live fluctuate slightly in tempo, we dynamically re-align with the beat onsets every time the ringtone repeats itself. All calculations are automated, content creators can toggle this feature on or off.

5.2.2 Key matching.

Using the key estimation, MARingBA  first finds the least amount of pitch shift required to avoid dissonance. It then adjusts the pitch value of the Unity mixer channel to the target pitch. Due to the nature of ringtones being a short melody instead of a fully orchestrated song, the pitches used in the ringtone may fit into multiple different keys without causing dissonance. We exploit this quality in our implementation of key matching by iterating through all the different possible keys to shift to and finding the least pitch-shift required to avoid dissonance instead of pitch-shifting to match the exact key the song is in. All calculations are automated, content creators can toggle this feature on or off.

5.2.3 Scheduling.

Our system enables content creators to define whether a ringtone should be scheduled either to (1) the closest start of beat, (2) start of bar, (3) or start of four bars. Timestamps of beats and downbeats (start of bar) are estimated using MIR libraries. We assume that the first downbeat of the song is the start of a section and keep an internal counter to label every fourth beat after that to be the start of four bars. While this assumption worked well for the approximately two dozen popular songs we tested, it might not hold for all music. We hope to implement techniques to overcome this in the future. Scheduling is automated, and content creators choose between the different modes through a drop-down menu.

5.2.4 Panning.

We leverage a beat-by-beat comparison of root-mean-square (RMS) values between the left and right channels to determine if one has a louder volume. Positive values indicate a higher volume on the left channel, and negative values indicate a higher volume on the right channel. MARingBA  pans the ringtone sound using Unity’s AudioSource.panStereo, setting it to the value of volume difference on the current beat. In our system, content creators control the panning of notifications using a simple slider, with values between 0 and 100. The slider value is a multiplier to the balance data of the current song. A value of 0 will keep the sound of the ringtone centered. If one side is louder at the time a notification is played, MARingBA  will play the notification on the other side by multiplying the difference in RMS value with the slider value to determine the volume of the ringtone.

5.2.5 Timbre.

In our current implementation, instrument transfer is enabled by pre-rendering (i. e., manual transcription by the content creator) the ringtone in different instrument tones (piano, violin, or synthesizer) and selectively unmuting only one instrument track at a time based on the user-selected instrument. As mentioned above, this manual approach could be automated through learning-based timbre transfer techniques [53]. Reverb and filtering are both based on Unity’s mixer channel effects. The Unity Audio SFX Reverb effect is used for reverb; highpass and lowpass effects are used for frequency filtering. These effects are added to the same mixer channel that the notification is routed to.
In addition to instrument transfer, content creators can control the decay time and dry/wet level of the ringtone. Decay time controls how long it takes for the reflected sound to decrease in intensity, which dictates the perceived space of the reverb simulation. A longer decay time simulates a larger or more reflective space, whereas a shorter decay time simulates a smaller or less reverberant space. Dry/wet level balances the volume between the simulated reflected sound and the original non-reverberant sound.
As final timbre manipulation, content creators can filter frequencies, including high-pass and low-pass cutoffs.

5.2.6 Volume.

Notification, accompaniment, and vocals are each routed to their own mixer channel. The volume parameter of each channel are manipulated based on different parameters. Track-specific attenuation settings decrease the volume of specific tracks when the ringtone is played (e. g., decrease volume). Content creators can toggle whether vocals and accompaniment are audible when the ringtone is played. We can further control the rate of gradual fade-in of both the ringtone and the background music by setting a fade-in speed value. When toggled on, the fade-in speed slider (0.01 to 0.2 seconds) can be used to control how long it takes until the ringtone resumes with its original volume. The volume of the ringtone is controlled using sliders, and can be generalized across different songs by enabling the adaptive volume setting. This boosts the volume of the ringtone when the music is loud.

6 Notification Elicitation Study

We first conducted a study investigating if our manipulation parameters enable the creation of tailored notification adaptations for diverse usage scenarios. We recruited six musicians and music enthusiasts to use MARingBA  to integrate two notifications into three songs for three scenarios. Each scenario required a different level of urgency in response, ideally leading to designs with a corresponding attentional draw or noticeability, as suggested in prior research [26]. The goals of this study were to elicit qualitative perspectives on our approach and to collect an initial example set of parameter values for different urgency scenarios (Table 1). The latter served as the basis for our second study with end users (section 7), which preliminarily examined its effects on user behavior.

6.1 Procedure

Participants were asked to create notification adaptation parameters for three scenarios with different levels of urgency:
Low urgency: A user receives a notification about the temperature on the following day.
Medium urgency: A user receives a reminder about an upcoming meeting in two hours.
High urgency: A user receives an email from their supervisor that requires an immediate response.
Participants authored parameters for two notifications for each scenario, resulting in six parameter sets per participant. Each parameter set had to integrate the target notification into three songs per the scenario requirements.
The notifications were randomly selected from six ringtones curated from popular mobile devices and applications (e. g.,  iPhone marimba ringtone, Skype). While we did not explicitly search for notifications that differed in character, the final set represented a range of tempos (130 − 180 bpm, M = 158 bpm, SD = 22) and keys (4 represented).
Similarly, the songs were randomly selected from a set of 12 popular songs with at least 52M views on YouTube, spanning a range of years (1967-2020) and genres (e. g.,  R&B, hip hop, alternative rock, funk). The songs also represented a diversity of tempos (85 − 161 bpm, M = 126 bpm, SD = 20) and keys (9 represented).
Participants first completed a consent form and a demographic questionnaire for our study. Following this, they were introduced to our system and the adaptation parameters and given time to experiment with the application controls to familiarize themselves. Afterward, the notification scenarios were introduced in a counter-balanced order, and participants were asked to create notification adaptation parameters accordingly. During the tasks, the participants were instructed to follow a think-aloud protocol facilitated by the experimenter. Upon completion of all tasks, we conducted a semi-structured interview to gather their insights on their (1) approach and experience, (2) impressions of the concept of adaptively embedding notifications in music, and (3) suggestions for additional parameters.

6.2 Participants and apparatus

We invited six participants from a local university (6 male, age: M = 26 years, SD = 1). All participants had substantial musical experience (M = 13 years, SD = 5). One participant is a freelance performer, composer, and instrument manufacturer. Two participants had experience DJing and composing electronic music. One participant is a part-time jazz pianist. Two self-reported as music hobbyists. Participants received a $30 gift card as compensation.
The study was conducted using our MARingBA system implemented in a Unity 2021 Editor running on a MacBook Pro (macOS Ventura 13.4, 2.4 GHz Quad-Core Intel Core i5, stereo speakers with high dynamic range). All sessions were audio and video recorded. We recorded all final parameters for each condition.

6.3 Qualitative Feedback

Overall, participants recognized the potential benefits of blending notifications with music, especially in low-urgency and medium-urgency scenarios. They emphasized the importance of achieving a balanced integration, blending the notifications well while ensuring they remain perceivable amidst the music. While exploring various parameters, participants reported that beat matching, key matching, fade-in, and volume control for notifications and music are the most prioritized factors for achieving the desired effect.
While timbre transfer, reverb, and low/high pass were also prioritized by multiple participants, they also expressed concern that these parameters may fail to generalize across different songs. While a certain instrument timbre blends well with the instrumentation of one song, it may create vastly different effects when integrated into another song, making it hard to control when defining settings for a certain level of urgency. Additionally, participants highlighted the significance of retaining the original timbre for notifications, particularly in high-urgency scenarios, as it was strongly associated with the familiarity of the ringtone.
The study demonstrated that our system provided sufficient coverage of parameters, allowing participants to fine-tune the integration of notifications based on different musical contexts and urgency levels. However, a few participants expressed challenges in generalizing parameters across different songs or sections of the same song, where tempo, key, instrumentation, and volume can vary significantly. This revealed a potential need for more adaptive parameters that can respond to various musical contexts.
Another noteworthy advantage of our system was its added choice and customization options. Before using MARingBA, many participants mentioned that they often turned off notifications altogether for low-urgency scenarios to avoid disruptions. However, with the introduction of the blending notifications approach, content creators were open to enabling notifications for low-urgency situations without the fear of being startled by disruptive sounds. We believe this points to a good balance between staying informed and preserving their listening experience, leading to increased overall satisfaction with the notification system.

7 Initial Usage Study

Results from the notification elicitation study suggest that MARingBA  shows promise in enabling the integration of notifications into music with varying degrees of urgency. To further understand the benefits and limitations of adapting notifications to a background musical context, as well as to preliminarily validate that different notification adaptations may have a tangible effect on end-user task performance and experience, we enlisted 12 participants to test notifications while performing a typing task. The notifications were adapted using the parameters obtained from our design study. Our study investigates the following research questions: (RQ1) to what extent do different adaptations modulate noticeability? (RQ2) to what extent do our notification adaptations affect a user’s task performance? (RQ3) what aspects do users like and dislike about our approach to adapting notifications?

7.1 Design

We used a single-variable within-subject design with four adaptation methods (standard, low urgency, medium urgency, high urgency). Inspired by previous research on interrupts (e. g.,  [19, 20]), we adopted a dual-task paradigm where participants performed a primary task of typing while listening and responding to audio notifications manipulated with our adaptation methods as a secondary task. Participants experienced notifications adjusted using each adaptation methods twice, each time for a different notification and song (i. e., 4 adaptation method × 2 notification-song pairs = 8 repetitions). notifications and songs were randomly selected from the same pool as in the initial design study. We removed two songs since one was shorter than 3 minutes, and one had a section with very drastic tempo changes, which may lead to unexpected adaptation results. We counterbalanced the order of adaptation methods using a Latin Square.

7.1.1 Adaptation method.

To generate the parameters (Table 1) for the low urgency, medium urgency, and high urgency adaptation method conditions, we used the mean of the parameter settings generated from our design study (i. e.,  from the six participants) for continuous values and the mode for categorical values. Note that the parameters do not necessarily represent optimal settings for each condition, but rather a first example set. For the standard condition, we set the notification to mute the background music following conventional delivery approaches (section 3.2).
Table 1:
ParameterLow UrgencyMedium UrgencyHigh UrgencyStandard
Beat-matchTrue (11)True (8)True (6)False
Key-matchTrue (10)True (10)True (6)False
Panning Weight22.71 (28.64)25.72 (20.94)35.47 (30.07)0
Adaptive Volume Weight10.71 (10.54)12.25 (9.69)13.92 (11.46)0
Ringtone Volume23.0 (7.31)26.09 (4.03)28.55 (1.83)30
Timbre TransferOriginal (7, 1, 1, 2)Original (7, 3, 0, 1)Original (8, 2, 1, 0)Original
Fade-InTrue (9)False (9)False (7)False
Fade-In Speed0.13 (0.08)Not UsedNot UsedNot Used
Reverb Decay10.43 (6.21)8.1 (5.05)3.94 (3.00)0
Reverb Dry/Wet476.18 (319.80)351.45 (246.20))141.36 (156.41)0
Lowpass17994.91 (3717.25)17994.91 (3717.25)16364.91 (6037.04)20000
Highpass411.0 (422.58)151.09 (245.90)82.91 (154.66)10
Replace VocalFalse (6)True (9)True (10)True
Replace SongFalse (6)False (6)True (11)True
Song Volume DecreaseNot Used-8.88 (4.67)-15.9 (4.54)-30
Schedule ModeNext Four Bar (11, 0, 0)Next Four Bar (5, 4, 2)Next Beat (4, 3, 4)Not Used
Table 1: An initial example set of parameters for each delivery method, derived from our notification elicitation study (section 6) and used as delivery methods in the initial usage study (section 7). Interval values are summarized as M (SD). Categorical values are summarized with a majority vote (votes per category). The author designed parameter settings for standard notification delivery (column 5) to best replicate the volume reduction of commercial smartphones. The Timbre Transfer votes are listed in the following order: Original, Piano, Violin, Synth. The Schedule Mode votes are listed in the following order: Next Four Bar, Next Bar, Next Beat.

7.1.2 Primary task.

Participants were instructed to transcribe articles from Wikipedia as quickly and accurately as possible. The interface is shown in Figure 8. As visual feedback for the task, completed words are highlighted in green, typed letters of the current word are highlighted in yellow, and incorrectly typed characters turned the current letter red. Participants performed the typing task for three minutes in each condition.
Figure 8:
Figure 8: Interface for the empirical study. Participants transcribed articles from Wikipedia. Correctly typed words are highlighted in green. The typed letters of the current word are highlighted in yellow. Upon noticing a notification, participants clicked the notification detected button on the top right.

7.1.3 Secondary task.

While participants performed the typing task, they were asked to monitor for audio notifications simultaneously. Participants were instructed to click a button (see Figure 8, top right) when they noticed a notification as soon as possible. The button temporarily turned black to provide visual feedback for the response. In each condition, three notifications were delivered at randomized intervals at least 40 seconds apart, with the earliest and latest appearance at 0:10 minutes and 2:50 minutes, respectively.
We designed our study for a scenario where end users listen to their personal music collection. We therefore asked all users to rank the familiarity of the twelve songs used in the first expert study and used songs that were familiar to them. We also made sure to cover a wide range of songs, and thus did not always select the most-familiar songs.

7.2 Procedure

After participants completed a consent form and a demographic questionnaire, they were introduced to the study tasks. Subsequently, they completed the study conditions, which were structured into two blocks. Each block consists of four repetitions of the dual-task paradigm for a randomly selected song-notification pair. In each repetition, a different adaptation method was applied. Participants responded to questionnaires at the end of each condition. They also ranked the conditions by preference at the end of each block. After completing all conditions, participants reported their overall experience with the different notification delivery mechanisms.

7.3 Participants and apparatus

We recruited twelve participants from a local university and convenience sampling (7 male, 5 female, age: M = 24.67 years, SD = 3.75). Participants listened to music daily (M = 3 hours, SD = 1) during activities like work (N = 10) and exercise (N = 8), as well as during their commute (N = 10). Participants received a $15 gift card as compensation.
We integrated the typing and notification response tasks into our MARingBA  system, implemented in Unity 2021. The study was conducted on a MacBook Pro (macOS Ventura 13.4, 2.4 GHz Quad-Core Intel Core i5) with a pair of AKG K240 Studio over-ear headphones.

7.4 Measures

We captured participants’ primary and secondary task performance and subjective experience as dependent variables.
Primary task performance: We measured typing errors, i. e., the number of incorrectly typed keys, and resumption lag, i. e., the time between notification detection and resuming to a regular typing speed.
Secondary task performance: We measured reaction time  as the elapsed time between the start of the notification and the participant’s response, as well as the number of missed notifications.
Self-reported metrics: At the end of each condition, participants reported their confidence and immediacy in detecting the presented notifications, and characterized the notifications in terms of their noticeability and distraction, all on a scale from 1 (low) to 7 (high). At the end of each block, participants provided a preference ranking of the adaptation methods.

7.5 Results

For effect analysis, ordinal data (questionnaire ratings and rankings) were analyzed using Friedman tests and Wilcoxon signed-rank tests for post-hoc analysis when needed. Interval data (typing errors, resumption lag, reaction time, and number of missed notifications) was analyzed using a series of repeated-measured ANOVAs. In cases where the normality assumption was violated (Shapiro-Wilk test p  <  .05), we applied an Aligned Rank Transform (ART) before performing our analysis [52]. When needed, pairwise post-hoc tests (Bonferroni adjusted p-values) were performed. For each variable, the participant was considered a random factor and the adaptation method as a within-subject factor. The statistical analysis was performed in IBM SPSS Statistics 29.

7.5.1 Primary task performance.

We did not observe significant main effects on typing errors (p = 0.452) or resumption lag (p = 0.861). Participants made M = 46.94 typos, SD = 21.18, and exhibited a resumption lag of M = 8.68 s, SD = 1.90. Individual conditions were within \raise.17ex\(\scriptstyle \sim\) 2% of the mean for typos and \raise.17ex\(\scriptstyle \sim\) 4% of the mean for resumption lag.

7.5.2 Secondary task performance.

Questionnaire responses are shown in Figure 9. Across participants, a total of 288 notifications were delivered during the experiment. We found a main effect of adaptation method on reaction time (F3, 33  =  11.612, p  <  .001). On average, participants were significantly faster at responding to notifications in the standard (M = 16.29 s, SD = 3.69, p  <  0.001) and high urgency (M = 20.29 s, SD = 4.30, p  <  0.012) conditions compared to the low urgency condition (M = 34.33 s, SD = 2.314). Participants missed three notifications in the low urgency condition and one notification in the medium urgency condition. This highlights that all types of adaptation methods deliver notifications that are noticed, albeit later reaction time for lower-urgency ones.
Figure 9:
Figure 9: Participant’s subjective ratings across delivery methods on a scale from 1 (low) to 7 (high), ranking of preference from 1st (most preferred) to 4th (least preferred). Horizontal bars indicate statistically significant differences (p < .05) between conditions.

7.5.3 Self-reported metrics.

We found a main effect of adaptation method on confidence (χ2(3)  =  14.234, p  =  0.003), immediacy (χ2(3)  =  24.786, p  <  0.001), noticeability (χ2(3)  =  24.677, p  <  0.001), distraction (χ2(3)  =  24.636, p  <  0.001), and ranking (χ2(3)  =  13.622, p  =  0.003).
Compared to the standard adaptation method, participants first regarded both the low urgency and medium urgency adaptation methods as less noticeable (low urgency: p = 0.03, medium urgency: p = 0.042) and less distracting (low urgency: p = 0.012, medium urgency: p = 0.042). They also reported responding to the standard adaptation method more immediately than the low urgency adaptation method (p = 0.03). Compared to the high urgency adaptation method, participants regarded the low urgency adaptation method as less noticeable (p = 0.03) and less distracting (p = 0.03). They also reported responding to the high urgency adaptation method more immediately than the low urgency adaptation method (p = 0.03). Lastly, the reported preferring medium urgency adaptation method over the standard adaptation method (p = 0.042).

7.5.4 Discussion.

Within the context of the songs and notifications we evaluated, participants were generally highly successful in detecting notifications. While this can partly be attributed to the fact that they were aware that notifications would happen (i. e., cued detection), it nevertheless confirms that even notifications designed for low urgency are generally noticeable. Participants’ recognition speed and subjective ratings on perceived noticeability largely align, with standard  and high urgency  being rated more noticeable than the other two conditions. The ratings on distraction are directly related to noticeability, with standard  and high urgency being perceived as more distracting. Qualitative comments, however, also point in a different direction, where low noticeability is also perceived as distracting, as noted by one participant “I liked the way that the earlier styles [low urgency] were woven into the music rather than causing the music to stop or grow very faint, but depending on the song this might be more annoying and could be rather hard to detect.” (P9) Participants generally preferred a balance between noticeable and distracting, seemingly best fulfilled by the medium-urgency condition, as reflected in comments such as “While the more loud ones definitely caught my attention, I think that the more in-between notification sounds were ideal.” (P5)
Participants could also identify, or at least appreciate, certain manipulations MARingBA  performed. One participant reflected on how they enjoyed volume adjustments, including track-specific volume attenuation and fading: “I like it when the notifications come in, the music volume will be slightly reduced, and the notification volume gradually increases. I don’t like the places where the notifications just kick in without easing.” (P7) Finally, even though some participants could not identify certain features of our approach, they perceived beat and key matching as desirable, as noted by one participant: “I do not know if it was an accident, but one of the notification sounds came up exactly on the same beat of the Happy song, and I felt that it was actually nice for the notification to enter the song in the same tempo.” (P12). These results suggest that our approach may enable a better balance between noticeability and distraction and that participants preferred our music-adaptive notifications over standard delivery.

8 Scenarios of Usage

In the following, we provide examples where we envision music-adaptive notifications created with MARingBA provide a better experience than standard delivery methods. A key factor is that with our system, audio notifications can be delivered to match varying levels of urgency.
Scenario 1: A user is jogging through the city, listening to the song “One“ by U2. Their smartphone calendar app sends an audio reminder that they meet a friend in one hour. MARingBA  modifies the timbre of the notification to sound like a piano to match the style and match the song’s beat to integrate it better. Additionally, it plays at a lower volume since the notification does not require immediate attention.
Scenario 2: A user is deeply engrossed in their work, accompanied by a calming lofi hip hop playlist tailored for productive studying. Typically, they’d mute all notifications to maintain their focus. However, since the user recently applied to multiple jobs, they might receive important text messages for interviews that require a timely response, a medium-urgency scenario. Nevertheless, users should have sufficient time to finish their current subtask before attending to the message. MARingBA  blends the incoming audio notification of a text message seamlessly into the rhythmic drum beat of the lofi tunes. Slowly increasing in volume, the notification naturally captures the user’s attention. They adeptly address the task without disrupting their flow, underscoring how technology can facilitate a harmonious balance between productivity and responsiveness.
Scenario 3: The user and their friends grove to a continuous dance music mix at a lively party. Amidst the celebration, the user has an important task to attend to: taking the pizza out of the oven before it burns. To ensure they remember, they’ve set a timer. Instead of an abrupt alarm, the timer seamlessly integrates into the dance mix a custom notification sound only recognized by the user. The user catches the subtle reminder, grabs the pizza, and rejoins the party without disrupting the rhythm of the dance floor.

9 Discussion

We contribute a novel music-adaptive approach to delivering audio notifications such as ringtones. Our work explores the design space of possible audio manipulations such as beat matching, key matching, or timbre adjustments. We integrate those manipulations into MARingBA, a novel system that enables content creators to design adaptive audio notifications. We explore our system in an elicitation study with experts who design a variety of ringtones for different songs. Insights from the study indicate that MARingBA  enables them to explore the parameter space efficiently. We use the parameters obtained in the design study in a preliminary evaluation with end users. Results indicate that in the context of the songs and notifications we tested, our example set of music-adaptive audio notifications provided users with noticeable signals with varying reaction times and are preferred over a standard delivery baseline. We believe that music-adaptive notifications are a feasible complement, or even replacement, for current notification delivery methods. There exists, however, still unexplored areas of potential challenges and opportunities that we hope to explore in the future.

9.1 Generalizability

In our studies, we generated and tested an initial set of parameters for integrating notifications into different musical contexts. While the studies indicate that the approach may serve as a promising approach for modulating the noticeability of audio notifications, future work remains to identify optimal parameters for various urgency contexts and to ensure our approach’s generalizability.
First, to obtain the parameters for the various urgency conditions in our second study, we computed the mean for continuous values and determined the mode for categorical values. This process yielded an initial set of parameters that appeared to have the desired effect on users in a preliminary evaluation. Whether the presented parameters are necessarily the correct values for the various urgency conditions is still being determined. Additional empirical studies that test the parameter effects in a more controlled manner are needed to tease out their individual effects and interactions.
Second, while we aimed to curate a diverse set of notifications and songs, further experimentation is needed to ensure the generalizability of different parameter settings. It is important to note that the songs we used in our evaluation followed conventional Western tuning, consistent rhythm, and a standard key signature. Therefore, whether our findings and approach will generalize to songs that do not follow these conventions needs to be clarified. To explore this challenge, we tested our system with input songs that included complex and uncommon harmony structures, key changes, and time signatures. MARingBA managed to synchronize ringtones to the examples with complex harmony structures and key changes somewhat successfully. The notifications may still be perceived as out of place due to their specific musical qualities, such as instrument timbre and harmonic consistency. Additionally, if the time signatures of the music are highly irregular, our approach cannot reliably extract the beat. This makes integration challenging and can lead to sub-optimal placement of notifications. Future work can address these challenges by developing better MIR approaches to detect unconventional time signatures and harmony on a chord-by-chord basis. Allowing content creators to customize parameter settings for specific genres, intensities, and time signatures also enables our approach to be more broadly applicable.

9.2 Pre-processing of input music

Our current implementation requires pre-processing every input song, which takes approximately one minute per song. Since pre-processing only needs to occur once for each song, this information can be made available to systems such as MARingBA  by the on-device music library by processing the upcoming song or sent to the device in the metadata by any streaming service. This overcomes any challenges of real-time usage and makes the necessary information available in seconds. The amount of metadata required for each song may differ based on its tempo and duration, but we estimate it to be less than 10 kB per song. This metadata includes information like beat and downbeat onset, tempo, key, and panning. If source-separated stems are used, their size will be similar to that of the original song.

9.3 Additional parameters

We provide an exploration of the design space of adaptive sound notifications. We hope that our work serves as a foundation for future explorations. While the parameters we have explored offer significant coverage, there is still room for further exploration. For instance, we have yet to explore common audio effects used by music producers, such as delay, distortion, and chorus. Additionally, we still need to explore changing the ringtone to match the music’s genre or manipulating the ringtone’s original melody. We plan to dive deeper into this design space in the future and explore the benefits and limitations of the individual parameters.

9.4 Personalization

The settings we tested and the parameters of the audio notifications that the experts in the design study created worked well across the different songs and led to desirable task performance. Low-urgency notifications were perceived later than high-urgency ones and perceived as less distracting. In the qualitative results, however, we saw that several participants preferred the low-urgency notifications, whereas others wanted more salient signals like the medium-urgency settings. This hints at opportunities for personalization. Similar to current ringtones in phones, we believe that future versions of our approach should allow users to personalize their settings, and give end users agency in the design of the notifications. We are eager to explore the granularity of such personalization, from a single value parameter for “strength”, to giving users control over individual parameters such as timbre or key matching.

9.5 Multi-modal notifications

Our current approach is focused on audio notifications. Current interactive systems, however, deliver content and information through a wide range of modalities, from visual such as desktop or Mixed Reality settings, to haptic notifications through vibrations on a smartphone, or smell and taste. We plan to pair music-adaptive audio adaptations with other modalities in the future to investigate how well those can integrated into a multi-modal delivery mechanism. Additionally, we hope to explore applying our adaptive approach to other modalities. One could easily imagine an approach where the vibration of a haptic notification for a phone call is synchronized to the music that a user is currently listening to; or that visual notifications in Mixed Reality appear in a style that matches the musical genre. We are excited to explore those combinations in the future and find out what modalities are best suited for musical adaptations to be less disruptive, or potentially enhance the music listening experience for users.

9.6 In-the-wild studies

We currently evaluate our approach with end users in a highly controlled environment, where we control the space, task, and what music participants were listening to. We hope to expand our evaluation to more users and more context in the future. Other contexts such as tasks (e. g., running, shopping) and spaces (e. g., indoor, outdoor) will inevitably influence participants’ ability to detect audio notifications such as ringtones. Additionally, we believe that the question of notification scheduling is interesting for further investigation. For example, delaying notifications to opportune moments in later sections of the song, or even the next song, might be desirable in scenarios of focused work to minimize switching costs. In other scenarios, such as phone calls, the delivery cannot be delayed drastically, since the caller might hang up. We hope to explore this aspect in the future. Finally, we plan to expand our music-adaptive approach with physiological sensing [22] to balance noticeability, urgency, and distraction. Our music-adaptive content delivery is a first step towards context-aware delivery of audio notifications that are noticeable, not distracting, and ultimately beneficial for end users.

10 Conclusion

We contribute a novel approach to creating music-adaptive audio notifications by blending them into songs, leveraging a range of parameters such as beat, key, or timbre. An expert study confirms that our approach is valuable for designers of notifications. A preliminary evaluation demonstrates that an initial set of parameters derived from experts was able to modulate noticeability and perceived distraction and is preferred by end users. We believe that music-adaptive audio notifications have the potential to complement or replace current standard delivery methods, such as volume fading, and provide users with timely access to information without being disruptive. We started exploring a large parameter space, that can be used for a range of application scenarios. Our work lays the groundwork for future context-aware interactive systems that adapt audio notifications and non-visual digital content based on users’ current music, surroundings, and tasks.

Acknowledgments

We thank all involved peers, participants, and anonymous reviewers for their support. This work was partially supported by the Croucher Foundation.

Supplemental Material

MP4 File - Video Preview
Video Preview
Transcript for: Video Preview
MP4 File - Video Presentation
Video Presentation
Transcript for: Video Presentation

References

[1]
Piotr D. Adamczyk and Brian P. Bailey. 2004. If Not Now, When? The Effects of Interruption at Different Moments within Task Execution. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vienna, Austria) (CHI ’04). Association for Computing Machinery, New York, NY, USA, 271–278. https://doi.org/10.1145/985692.985727
[2]
Piotr D Adamczyk and Brian P Bailey. 2004. If not now, when? The effects of interruption at different moments within task execution. In Proceedings of the SIGCHI conference on Human factors in computing systems. Association for Computing Machinery, New York, NY, USA, 271–278.
[3]
Algoriddim. 2021. Automix AI - The Most Advanced Automatic Music Mixing. https://www.youtube.com/watch?v=0kDSpkkaar8
[4]
Algoriddim. 2023. Algoriddim. https://www.algoriddim.com/, Last accessed on 2023-12-10.
[5]
Ishwarya Ananthabhotla and Joseph A. Paradiso. 2018. SoundSignaling: Realtime, Stylistic Modification of a Personal Music Corpus for Information Delivery. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2, 4, Article 154 (dec 2018), 23 pages. https://doi.org/10.1145/3287032
[6]
E. Arroyo, T. Selker, and A. Stouffs. 2002. Interruptions as multimodal outputs: which are the less disruptive?. In Proceedings. Fourth IEEE International Conference on Multimodal Interfaces. IEEE, 479–482. https://doi.org/10.1109/ICMI.2002.1167043
[7]
Brian P. Bailey and Joseph A. Konstan. 2006. On the need for attention-aware systems: Measuring effects of interruption on task performance, error rate, and affective state. Computers in Human Behavior 22, 4 (2006), 685–708. https://doi.org/10.1016/j.chb.2005.12.009 Attention aware systems.
[8]
Brian P Bailey, Joseph A Konstan, and John V Carlis. 2001. The Effects of Interruptions on Task Performance, Annoyance, and Anxiety in the User Interface. In IFIP TC13 International Conference on Human-Computer Interaction, Vol. 1. 593–601. https://api.semanticscholar.org/CorpusID:10423207
[9]
Luke Barrington, Michael J. Lyons, Dominique Diegmann, and Shinji Abe. 2006. Ambient Display Using Musical Effects. In Proceedings of the 11th International Conference on Intelligent User Interfaces (Sydney, Australia) (IUI ’06). Association for Computing Machinery, New York, NY, USA, 372–374. https://doi.org/10.1145/1111449.1111541
[10]
Alfred Blatter. 2016. Revisiting music theory: basic principles. Taylor & Francis.
[11]
Meera M. Blattner, Denise A. Sumikawa, and Robert M. Greenberg. 1989. Earcons and Icons: Their Structure and Common Design Principles. Human–Computer Interaction 4, 1 (1989), 11–44. https://doi.org/10.1207/s15327051hci0401_1
[12]
Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer. 2016. Madmom: A New Python Audio and Music Signal Processing Library. In Proceedings of the 24th ACM International Conference on Multimedia (Amsterdam, The Netherlands) (MM ’16). Association for Computing Machinery, New York, NY, USA, 1174–1178. https://doi.org/10.1145/2964284.2973795
[13]
Deborah A. Boehm-Davis and Roger Remington. 2009. Reducing the disruptive effects of interruption: A cognitive framework for analysing the costs and benefits of intervention strategies. Accident Analysis & Prevention 41, 5 (2009), 1124–1129. https://doi.org/10.1016/j.aap.2009.06.029
[14]
Dmitry Bogdanov, Nicolas Wack, Emilia Gómez, Sankalp Gulati, Perfecto Herrera, Oscar Mayor, Gerard Roma, Justin Salamon, José Zapata, and Xavier Serra. 2013. ESSENTIA: An Open-Source Library for Sound and Music Analysis. In Proceedings of the 21st ACM International Conference on Multimedia (Barcelona, Spain) (MM ’13). Association for Computing Machinery, New York, NY, USA, 855–858. https://doi.org/10.1145/2502081.2502229
[15]
Stephen Brewster. 2007. Nonspeech auditory output. In The human-computer interaction handbook. CRC Press, 273–290.
[16]
Andreas Butz and Ralf Jung. 2005. Seamless User Notification in Ambient Soundscapes. In Proceedings of the 10th International Conference on Intelligent User Interfaces (San Diego, California, USA) (IUI ’05). Association for Computing Machinery, New York, NY, USA, 320–322. https://doi.org/10.1145/1040830.1040914
[17]
Nick Collins. 2010. Introduction to computer musicr. John Wiley & Sons.
[18]
Fulvio Corno, Luigi De Russis, and Teodoro Montanaro. 2017. XDN: Cross-Device Framework for Custom Notifications Management. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems (Lisbon, Portugal) (EICS ’17). Association for Computing Machinery, New York, NY, USA, 57–62. https://doi.org/10.1145/3102113.3102127
[19]
Edward B. Cutrell, Mary Czerwinski, and Eric Horvitz. 2000. Effects of Instant Messaging Interruptions on Computing Tasks. In CHI ’00 Extended Abstracts on Human Factors in Computing Systems (The Hague, The Netherlands) (CHI EA ’00). Association for Computing Machinery, New York, NY, USA, 99–100. https://doi.org/10.1145/633292.633351
[20]
Mary Czerwinski, Edward Cutrell, and Eric Horvitz. 2000. Instant messaging and interruption: Influence of task type on performance. In OZCHI 2000 conference proceedings, Vol. 356. 361–367.
[21]
Matthew E. P. Davies, Philippe Hamel, Kazuyoshi Yoshii, and Masataka Goto. 2014. AutoMashUpper: Automatic Creation of Multi-Song Music Mashups. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 12 (2014), 1726–1737. https://doi.org/10.1109/TASLP.2014.2347135
[22]
Pascal E Fortin, Elisabeth Sulmont, and Jeremy Cooperstock. 2019. Detecting perception of smartphone notifications using skin conductance responses. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–9. https://doi.org/10.1145/3290605.3300420
[23]
Stavros Garzonis, Simon Jones, Tim Jay, and Eamonn O’Neill. 2009. Auditory icon and earcon mobile service notifications: intuitiveness, learnability, memorability and preference. In Proceedings of the SIGCHI conference on human factors in computing systems. Association for Computing Machinery, New York, NY, USA, 1513–1522.
[24]
William W. Gaver. 1993. Synthesizing Auditory Icons. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (Amsterdam, The Netherlands) (CHI ’93). Association for Computing Machinery, New York, NY, USA, 228–235. https://doi.org/10.1145/169059.169184
[25]
Sarthak Ghosh, Lauren Winston, Nishant Panchal, Philippe Kimura-Thollander, Jeff Hotnog, Douglas Cheong, Gabriel Reyes, and Gregory D. Abowd. 2018. NotifiVR: Exploring Interruptions and Notifications in Virtual Reality. IEEE Transactions on Visualization and Computer Graphics 24, 4 (2018), 1447–1456. https://doi.org/10.1109/TVCG.2018.2793698
[26]
Jennifer Gluck, Andrea Bunt, and Joanna McGrenere. 2007. Matching Attentional Draw with Utility in Interruption. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (San Jose, California, USA) (CHI ’07). Association for Computing Machinery, New York, NY, USA, 41–50. https://doi.org/10.1145/1240624.1240631
[27]
Robert Graham. 1999. Use of auditory icons as emergency warnings: evaluation within a vehicle collision avoidance application. Ergonomics 42, 9 (1999), 1233–1248.
[28]
Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. 2020. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5, 50 (2020), 2154.
[29]
Edward Cutrell Mary Czerwinski Eric Horvitz. 2001. Notification, disruption, and memory: Effects of messaging interruptions on memory and performance. In Human-Computer Interaction: INTERACT, Vol. 1. 263.
[30]
Scott Hudson, James Fogarty, Christopher Atkeson, Daniel Avrahami, Jodi Forlizzi, Sara Kiesler, Johnny Lee, and Jie Yang. 2003. Predicting Human Interruptibility with Sensors: A Wizard of Oz Feasibility Study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Ft. Lauderdale, Florida, USA) (CHI ’03). ACM, New York, NY, USA, 257–264. https://doi.org/10.1145/642611.642657
[31]
Image-Line. 2023. FL Studio. https://www.image-line.com/, Last accessed on 2023-12-10.
[32]
Shamsi T. Iqbal and Brian P. Bailey. 2006. Leveraging Characteristics of Task Structure to Predict the Cost of Interruption. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Montréal, Québec, Canada) (CHI ’06). Association for Computing Machinery, New York, NY, USA, 741–750. https://doi.org/10.1145/1124772.1124882
[33]
Shamsi T. Iqbal and Brian P. Bailey. 2011. Oasis: A Framework for Linking Notification Delivery to the Perceptual Structure of Goal-Directed Tasks. ACM Trans. Comput.-Hum. Interact. 17, 4, Article 15 (dec 2011), 28 pages. https://doi.org/10.1145/1879831.1879833
[34]
Hiromi Ishizaki, Keiichiro Hoashi, and Yasuhiro Takishima. 2009. Full-Automatic DJ Mixing System with Optimal Tempo Adjustment based on Measurement Function of User Discomfort. In Proceedings of the 10th International Society for Music Information Retrieval Conference, ISMIR 2009. 135–140. https://api.semanticscholar.org/CorpusID:6179832
[35]
Ralf Jung. 2008. Ambience for auditory displays: Embedded musical instruments as peripheral audio cues. In Proc. ICAD.
[36]
Mohamed Kari, Tobias Grosse-Puppendahl, Alexander Jagaciak, David Bethge, Reinhard Schütte, and Christian Holz. 2021. SoundsRide: Affordance-synchronized music mixing for in-car audio augmented reality. In The 34th Annual ACM Symposium on User Interface Software and Technology (Virtual Event USA). ACM, New York, NY, USA. https://doi.org/10.1145/3472749.3474739
[37]
Thomas Kubitza, Alexandra Voit, Dominik Weber, and Albrecht Schmidt. 2016. An IoT Infrastructure for Ubiquitous Notifications in Intelligent Living Environments. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct (Heidelberg, Germany) (UbiComp ’16). Association for Computing Machinery, New York, NY, USA, 1536–1541. https://doi.org/10.1145/2968219.2968545
[38]
Uichin Lee, Joonwon Lee, Minsam Ko, Changhun Lee, Yuhwan Kim, Subin Yang, Koji Yatani, Gahgene Gweon, Kyong-Mee Chung, and Junehwa Song. 2014. Hooked on Smartphones: An Exploratory Study on Smartphone Overuse among College Students. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI ’14). Association for Computing Machinery, New York, NY, USA, 2327–2336. https://doi.org/10.1145/2556288.2557366
[39]
Paul MC Lemmens, Myra P Bussemakers, and Abraham De Haan. 2001. Effects of auditory icons and earcons on visual categorization: the bigger picture. In Proceedings of the International Conference on Auditory Display. 117–125.
[40]
Aristides Mairena, Carl Gutwin, and Andy Cockburn. 2019. Peripheral Notifications in Large Displays: Effects of Feature Combination and Task Interference. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300870
[41]
Tara Matthews, Anind K. Dey, Jennifer Mankoff, Scott Carter, and Tye Rattenbury. 2004. A Toolkit for Managing User Attention in Peripheral Displays. In Proceedings of the 17th Annual ACM Symposium on User Interface Software and Technology (Santa Fe, NM, USA) (UIST ’04). Association for Computing Machinery, New York, NY, USA, 247–256. https://doi.org/10.1145/1029632.1029676
[42]
D. Scott McCrickard and C. M. Chewar. 2003. Attuning Notification Design to User Goals and Attention Costs. Commun. ACM 46, 3 (mar 2003), 67–72. https://doi.org/10.1145/636772.636800
[43]
Brian McFee, Colin Raffel, Dawen Liang, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In SciPy.
[44]
Abhinav Mehrotra, Veljko Pejovic, Jo Vermeulen, Robert Hendley, and Mirco Musolesi. 2016. My phone and me: understanding people’s receptivity to mobile notifications. In Proceedings of the 2016 CHI conference on human factors in computing systems. 1021–1032.
[45]
Philipp Müller, Sander Staal, Mihai Bâce, and Andreas Bulling. 2022. Designing for Noticeability: Understanding the Impact of Visual Importance on Desktop Notifications. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 472, 13 pages. https://doi.org/10.1145/3491102.3501954
[46]
Richard W. Obermayer and William A. Nugent. 2000. Human-computer interaction for alert warning and attention allocation systems of the multimodal watchstation. In Integrated Command Environments, Patricia Hamburger (Ed.). Vol. 4126. International Society for Optics and Photonics, SPIE, 14 – 22. https://doi.org/10.1117/12.407536
[47]
Martin Pielot, Karen Church, and Rodrigo de Oliveira. 2014. An In-Situ Study of Mobile Phone Notifications. In Proceedings of the 16th International Conference on Human-Computer Interaction with Mobile Devices & Services (Toronto, ON, Canada) (MobileHCI ’14). Association for Computing Machinery, New York, NY, USA, 233–242. https://doi.org/10.1145/2628363.2628364
[48]
Marco A Martinez Ramirez, Weihsiang Liao, Chihiro Nagashima, Giorgio Fabbro, Stefan Uhlich, and Yuki Mitsufuji. 2022. Automatic music mixing with deep learning and out- of-domain data. In Proceedings of the 23rd International Society for Music Information Retrieval Conference. ISMIR, Bengaluru, India, 411–418. https://doi.org/10.5281/zenodo.7316688
[49]
VirtualDJ. 2023. VirtualDJ - User Manual - Interface - Browser - SideView - Automix. https://www.virtualdj.com/manuals/virtualdj/interface/browser/sideview/automix.html, Last accessed on 2023-12-10.
[50]
Alexandra Voit, Tonja Machulla, Dominik Weber, Valentin Schwind, Stefan Schneegass, and Niels Henze. 2016. Exploring Notifications in Smart Home Environments. In Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct (Florence, Italy) (MobileHCI ’16). Association for Computing Machinery, New York, NY, USA, 942–947. https://doi.org/10.1145/2957265.2962661
[51]
Dominik Weber, Alireza Sahami Shirazi, and Niels Henze. 2015. Towards Smart Notifications Using Research in the Large. In Proceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct (Copenhagen, Denmark) (MobileHCI ’15). Association for Computing Machinery, New York, NY, USA, 1117–1122. https://doi.org/10.1145/2786567.2794334
[52]
Jacob O. Wobbrock, Leah Findlater, Darren Gergle, and James J. Higgins. 2011. The Aligned Rank Transform for Nonparametric Factorial Analyses Using Only Anova Procedures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada) (CHI ’11). Association for Computing Machinery, New York, NY, USA, 143–146. https://doi.org/10.1145/1978942.1978963
[53]
Jing Yang, Tristan Cinquin, and Gábor Sörös. 2021. Unsupervised Musical Timbre Transfer for Notification Sounds. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3735–3739. https://doi.org/10.1109/ICASSP39728.2021.9414760
[54]
Jing Yang and Andreas Roth. 2021. Musical Features Modification for Less Intrusive Delivery of Popular Notification Sounds. Proceedings of the 26th International Conference on Auditory Display (ICAD 2021) (2021). https://api.semanticscholar.org/CorpusID:236204585

Cited By

View all
  • (2024)SoundShift: Exploring Sound Manipulations for Accessible Mixed-Reality AwarenessProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661556(116-132)Online publication date: 1-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems
May 2024
18961 pages
ISBN:9798400703300
DOI:10.1145/3613904
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2024

Check for updates

Badges

  • Honorable Mention

Author Tags

  1. Adaptive Interfaces
  2. Audio
  3. Music Computing
  4. Notifications

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CHI '24

Acceptance Rates

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Upcoming Conference

CHI 2025
ACM CHI Conference on Human Factors in Computing Systems
April 26 - May 1, 2025
Yokohama , Japan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,587
  • Downloads (Last 6 weeks)221
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SoundShift: Exploring Sound Manipulations for Accessible Mixed-Reality AwarenessProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661556(116-132)Online publication date: 1-Jul-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media