1. Introduction

Visual aspects of music performances are often important. In live concerts, performers use various kinds of body movements to express their emotions and to impress audiences (; ). In music ensembles, visual interactions among musicians are important for coordination of timing and dynamics. In pop music, creative visual performances give artists a substantial competitive advantage. The inclusion of videos in music albums is shown to provide an eight-percent boost, on average, in purchase intent and improved perception (measured by Nielsen Holdings). Even in prestigious classical music performances, research has shown that body movements and facial expressions of performers exert strong influences on the judgment of performance quality, for expert or novice audiences alike ().

On the technical side, the rapid expansion of digital storage and Internet bandwidth in the past decades has not only popularized video streaming services like YouTube but also significantly changed the way people enjoy music. With the surge of Virtual Reality (VR) and Augmented Reality (AR) technologies and their adoption in music entertainment, visual aspects of music performances will further gain importance in innovative music enjoyment experiences.

While Music Information Retrieval (MIR) based on the audio signal and symbolic score (e.g., MIDI) has been widely studied, only limited explorations have been conducted on the interplay of visual and acoustic aspects of music performances. The auditory and visual modalities are intimately related in music performances. Sounds from acoustic instruments are invariably mediated by the instrument players’ movements and characteristics of the movements are reflected in the resulting sounds. For example, the amplitude envelope and spectral evolution of a violin note are directly related to the velocity and pressure of a bowing motion () and fingering force (); the timing of a clarinet note is often correlated to the fingering movements; the loudness of a drum hit is strongly related to the drumstick’s preparatory height and striking velocity (). These characteristics have been utilized to solve traditional MIR problems such as multi-pitch analysis (), music transcription (), score alignment (), and source separation (). An overview of related literature is available ().

Classical chamber music is performed by a small ensemble of instrumentalists, with one player per score part (). In this paper, we study the relationship between the instrument players’ body movements and sound events in classical chamber ensemble performances. The aim is to solve the source association problem, i.e., identifying the bijection between score parts (MIDI or MusicXML format) and players in the video. The bijection, together with a score-informed audio source separation technique (), can allow users to separate the audio source for each particular player in the video.

Exploiting information in the video about instrument players’ movements for source association is challenging because many body movements (e.g., head movement) are irrelevant to sound articulation () and relevant movements (e.g., maneuvers with the fingers) can be subtle. In music ensembles, similar body movements can be observed among different musicians when they have similar rhythmic patterns. These challenges are especially pronounced when the video clip is short (e.g., from online streams) and when the ensemble is large. For a quintet, possible associations can be enumerated as 120 permutations, but only one is correct.

Source association enables novel research and applications. It is essential for leveraging the visual information to analyze individual sound sources in music performances. The related techniques include multi-pitch analysis (), performance expressiveness analysis (), and source separation (). By exploiting source associations, one can envision an augmented video streaming service that allows users to click on a player in the video and isolate or enhance the corresponding source of the audio (). Based on SLAVE (), a music exploration system that manages multimedia music collections, one can envision an augmented sheet music display interface where for each score part, the visual performance of the specific player is retrieved and demonstrated. For music production, source association can help enable remixing of audio sources along with automatic video scene recomposition. An online source association system, which does not need to “look into the future”, can be further useful in online video streaming of live concerts. For example, it enables an auto-whirling camcorder to focus on the soloist.

In this paper, we build upon our previous work on source association for string instruments using bowing motion () and vibrato motion (), and propose the first universal system to address the problem for common melodic instruments in Western chamber ensembles such as string, woodwind, and brass instruments (barring polyphonic instruments such as piano and harp). This system does not require prior knowledge of instrumentation of the piece or pre-training of audio-visual correspondence. The system input is the audio and video of the performance and the corresponding music score in a pianoroll representation, and the output is the association between audio or score tracks and players in the video, assuming audio and video tracks are synchronized and audio and score tracks are associated. After temporally aligning the score with the live performance from auditory cues, the system uses three modules to analyze different visual motion types that may be present in the performance, as shown in Figure 1. Because many performed motions are related to note onsets, the first two modules focus on the motion-onset correspondence. The first module extracts large-scale body motion, which mainly captures bowing motion of string instruments. The second module extracts subtle fingering motion and correlates this with note onsets. The correlation aids associations for woodwind and brass instruments, as pitch changes are mostly controlled by finger-operated keys. In addition to note onsets, variations of acoustic features via tone articulations also show correspondence with certain motion, for instance, for the vibrato articulation in string instruments. Therefore, the third module is designed to detect periodic fingering motion (if any) and to correlate them with the periodic pitch fluctuation estimated from audio. This module is primarily directed at string instruments, where vibrato articulations can be characterized from the visual modality. Note that the first and third modules are adapted from previously proposed systems by Li et al. (, ) respectively, and the second module is proposed in this paper as the first solution for woodwind and brass instruments. Finally, we also propose to integrate the output of the three modules through weighted voting according to motion salience. It is noted that the system does not need to detect the instrument type; it simply extracts the three kinds of motion (if any) for each player and integrates their correspondence with score and audio tracks, jointly.

Figure 1 

Outline of the proposed universal source association system for chamber ensemble performances. Three types of motion are modeled and correlated with the audio and score events in three modules.

The proposed system works in an online fashion: the audio-score alignment, the correlation between motion and audio or score, and the association output are all updated in a frame-by-frame fashion without “looking into the future”. Associations in each frame are updated using the Hungarian algorithm (), with a minimum computational cost. Experiments on 17,574 audio-visual clips generated from 44 chamber music pieces in the URMP dataset () that spans a polyphony range from duets to quintets, show that: 1) Different modules are helpful for different instruments, and the system is able to integrate them automatically to achieve a high overall accuracy; 2) Accuracy increases as longer video streams are available, reaching an average accuracy of 90% for 5-second video excerpts of string instruments, and for 30-second excerpts of woodwind and brass instruments. In summary, the proposed system for audio-visual source association:

  • works universally for all instruments common in Western chamber ensemble performances,
  • does not require prior knowledge of instrumentation, and
  • relies purely on motion information for association without modeling instrument characteristics; which allows it to also work for ensembles of the same instrument type, e.g., violin duets.

In the following, we first review existing work on multi-modal modeling in Section 2, and highlight challenges involved in source association in music performances. We then describe our proposed method in three modules for the different motion cues for associations in Section 3. In Section 4, we conduct systematic experiments to evaluate the proposed system. Finally, we conclude the paper in Section 5.

2. Related Work

2.1. Source Localization

When there is at most one active sound source at a time, the problem of audio-visual source association is also known as source localization, i.e., indicating the location of the sound source in the video. For audio-visual speech, source localization is helpful for speaker face segmentation (). Early work on speaker localization correlates audio energy changes with pixel motion via non-linear diffusion () or with semantic regions via video segmentation and tracking (). Other methods include time-delayed neural networks (), probabilistic multi-modal generative models (), and Canonical Correlation Analysis (CCA) (; ).

More recent work proposes to localize semantic objects in unconstrained videos by learning deep multi-modal representations. Owens and Efros () propose a fused multi-sensory network to learn an audio-visual representation, which further localizes the sound objects on the video frames. Senocak et al. () employ a similar two-stream network structure, where an attention mechanism is developed for sound source localization. A similar idea is adopted by Arandjelović and Zisserman () for cross-modal retrieval and source localization, and by Tian et al. () for both spatial and temporal localization.

2.2. Source Association for Separation

Other work deals with mixtures of active sources, where cross-modal association can be applied to isolate sounds that correspond to each visual object. Barzelay and Schechner (, ) detect drastic changes (i.e., onsets of events) in audio and video and then use their coincidence to associate audio-visual components that belong to the same source of harmonic sounds. Sigg et al. () reformulate CCA by introducing non-negativity and sparsity constraints on the coefficients of the projection directions to locate and separate sound sources in movies. In (), auditory and visual modalities are decomposed into relevant structures using redundant representations for source localization. Segments, where only one source is active, are used to learn a timbre model for the separation of the source. Ephrat el al. () propose a deep network-based model to isolate single speech signals from a mixture of sounds given the target speaker from the video. Gao et al. () map audio frequency basis to individual visual objects via an audio-visual object model, which further guides audio source separation. Most of these methods, however, either deal with mixtures with at most two active sources or only focus on isolating one source from multiple active sources (e.g., background noises). The association problem for each individual source is not addressed.

2.3. Source Association for Chamber Ensembles

The source association problem for music ensembles is more challenging since all the available sound sources (the players) are active almost all the time, and the difficulty increases dramatically as the number of sources increases. Although each score part is performed by one player in chamber music, the same kind of instruments are often used for different score parts (e.g., a violin duet). Therefore, approaches aiming at learning deep representations that map acoustic features with visual appearances to localize each source (; ; ) are not applicable. Instead, one needs to recognize the distinct motion of different players and correlate them with the music content to achieve association.

Bazzica et al. () first propose to distinguish play from non-play conditions for each player in an orchestra, which are compared with each score part to solve the temporal alignment. In our previous work (), we propose an approach to solving the association problem for string ensembles with up to five simultaneously active sources in a score-informed fashion. The approach analyzes the bowing motion and correlates it with note onsets in score parts. The assumptions are that many note onsets correspond to the beginning of bowing strokes and that different instrumental parts often have different rhythmic patterns. When these assumptions are invalid, for example, when multiple notes are played within a single bow stroke (i.e., legato bowing) or when different parts show a similar rhythmic pattern, the approach becomes less robust. Later we propose a complementary approach () which correlates the fingering hand rolling motion with pitch fluctuations of vibrato notes for the association of string instruments. However, the method only works when vibrato notes are played. To our best knowledge, there is neither an existing work on integrating the bowing motion and vibrato motion for source association for string instruments, nor any extensions of the concept to deal with non-string instruments.

3. Method

The proposed system takes data in three modalities as the input: the audio recordings, the video recordings, and the music scores of the chamber music performances. As illustrated in Figure 1, the system uses three parallel modules to model three types of temporal correspondence between motions detected in the video and music events captured in other modalities for different instrumentalists. In this section, we present the system in detail.

3.1. Performance-Score Alignment

As the proposed approach is score informed, a preliminary step for the system is to temporally align the music score with the dynamic timing of the audio-visual ensemble performance (assuming audio and video are pre-synchronized). The temporal alignment is achieved through audio-score alignment on the harmonic content (). To do so, the audio is first converted to short-time Fourier spectral magnitudes with a 42.7 ms frame length (2048 samples for a 48 kHz sampling rate), 10 ms hop size, Hamming window, and zero padding to produce 4 times the original length. The short-time Fourier spectral magnitudes are then mapped to 12-dimonsional chroma vectors, where each element represents a pitch class. Each chroma vector is normalized by its root mean square (RMS) value. A similar operation is applied to the score, which is segmented into non-overlapping frames of the same duration using the default tempo notated in the score. A 12-D binary chroma vector is calculated for each frame to indicate the presence (taking more than 50% of the frame) or absence of each pitch class. The chroma vector is then normalized by its RMS value.

In offline scenarios where the entire performance is available beforehand, the alignment can be obtained by the dynamic time warping (DTW) algorithm (), which is robust and efficient (). In online scenarios where the performance data arrives as a live stream, one commonly used framework is an online DTW algorithm (), which provides options such as the “forward-backward strategy” to reconsider past decisions (), or incorporating a tempo model () for robustness. An alternative framework employs a stochastic model (; ), where the score position hypotheses are represented by a probability density function. In this paper, to deal with online video streaming scenarios, we apply the online method proposed by Duan and Pardo (), which is based on a hidden Markov model with a 2-D continuous state space to represent the score position and tempo. This framework is previously evaluated on the Bach10 dataset () showing decent results. A further qualitative check confirms a good alignment performance on the URMP dataset used in our experiments.

3.2. Onset Correspondence with Body Motion

3.2.1. Body Motion Extraction

In music performances, body motion of performers conveys important musical expressions and ideas, e.g., the head nodding at leading notes. For some instruments, body motion directly articulates notes (e.g., strings, drums) or controls the pitch (e.g., trombones). To capture body motion from video recordings, one approach is optical flow estimation. In our previous approach () we apply optical flow estimation to extract bowing motion of string players. However, we argue that this pixel-level analysis may not be ideal for semantic-level understanding of body gestures and movements, and can be less robust to occlusions and camera viewpoint changes.

In this paper, we propose to apply OpenPose () on each frame, a multi-person pose estimation approach to extract body skeleton coordinates for all the players on stage without pre-segmentation of the video recording. A skeleton in each frame is represented as a 20-D vector y(t) corresponding to the horizontal and vertical coordinates of the 10 upper body joints, including nose, neck, shoulders, elbows, wrists, and hips. We do not include lower body joints as they are usually less relevant to music events. Figure 2 shows video frames of several instrumentalists with the extracted body skeletons. To form a continuous skeleton sequence across time, we eliminate joint coordinates if the confidence score from OpenPose is smaller than 0.2 and the L2 distance between a joint in consecutive frames is larger than 10% of the head-hip distance, which is considered the maximal regular movement in a ≈30-FPS video without shot transition. We also temporally smooth their coordinates using a moving average with a 5-frame window size. These post-processing steps are referred from () where the same approach is applied to extract skeletons for pianists. We then take the two hips as reference coordinates to align the body position across frames. Finally, we calculate motion velocities z(t) as the derivative of y(t) w.r.t. time. Compared to optical flow estimation, this gesture-based motion analysis approach is semantically more meaningful, less computationally expensive, and more robust to occlusions and camera viewpoint changes such as camera zooming or panning.

Figure 2 

Body motion extraction. Upper body skeletons (second row) are extracted with OpenPose () in each video frame (first row) followed by temporal smoothing.

To extract motion related to note onsets in each video frame, for each player we denote the motion velocities of n frames in the past as Z ∈ ℝn×20 and apply principal component analysis (PCA) by eigenvalue decomposition ZTZ = V∑VT, where V and ∑ represent the matrix of eigenvectors and the corresponding eigenvalues respectively. We then project the motion velocity z(t) onto the principal component direction (first column of V) and take its absolute value as the motion salience s(t). Choosing the salient motion discards the direction information of the motion (e.g., up/down-bow for violinists), which is less relevant to timing than the amplitude information. We set n to 150 frames, i.e., 5 seconds in time, assuming a player’s pose stays consistent in this range. To reduce the computational cost, we update V every 1 second (assuming consistent motion patterns within a short period).

3.2.2. Onset Likelihood

From the motion salience s(t), we infer the timings of the motion strokes that are potentially related to score note onsets. As a note onset often corresponds to the beginning or ending of a sound articulation motion (e.g., a bowing stroke for string instruments), the motion speed at the onset is often small. Therefore, local minima of the motion salience s(t) are often indicative of note onsets. Let Ω be the set of all the local minima throughout a piece. For each local minimum τ ∈ Ω, we represent the likelihood of a note onset as a(τ) = maxγ∈[τ,τ+30]s(γ) – s(τ) that is determined by the maximum speed of the motion stroke considering the following 30 frames: the larger the value of a(τ) the more likely that a note onset is activated by the motion stroke. Here 30 frames are considered to span the high energy part of most notes. Therefore, we can define an onset likelihood curve φb(t) derived from body motion analysis as

(1)
ϕb(t)=(τΩa(τ)δ(tτ))*N(t),

where δ(t) is the Dirac delta function, * is the convolution operation, and N(t) is a Gaussian function to give each predicted onset time a tolerance (width) with a standard deviation of 3 frames (100 ms) (considering some slight non-synchronization between different modalities in the recording file). It is noted that φb(t) can be calculated in an online fashion, with a delay of up to 1 second due to the search for the local maximum after each local minimum. Figure 3 plots the onset likelihood curve φb(t) along with the associated and temporally aligned score part as piano-roll, where the note onset timings are marked as red circles. We find that many of the note onsets can be associated with peaks of φb(t). The correspondence between the notes and peaks sets the basis for the association between score and motion, as described below.

Figure 3 

Example correspondence between body motion and note onsets. Top: temporally aligned score part with onsets marked by red circles. Middle: extracted motion salience (primarily bowing motion) from the visual performance of a violin player. Bottom: derived onset likelihood curve from the motion salience.

3.2.3. Pair-wise Correspondence

We extract the motion-based onset likelihood curve for each player from the video performance as ϕb[p](t), where p is the player index. From each part of the temporally aligned score, we use a binary impulse train ψ[q](t) to represent the note onsets, where q is the part index, ψ[q](t) = 1 if there is a note onset in the t-th frame of the q-th score part and ψ[q](t) = 0 otherwise. Then the pair-wise matching score between the p-th player and the q-th score part, up to the t-th frame, can be calculated through inner product:

(2)
Mb[p,q](t)=τ=0tϕb[p](τ)ψ[q](τ).

This can be updated in an online fashion as new temporal frames arrive.

3.3. Onset Correspondence with Finger Motion

3.3.1. Finger Motion Extraction

While note articulation is visible in the body movements of string instrumentalists, this is generally not the case for woodwind and brass players, where notes are articulated by blowing into the reed or mouthpiece, showing little visible motion around the mouth. However, pitch changes of these instruments are mostly controlled by finger-operated keys, resulting in synchronized events between finger movements and note onsets (). Compared to body motion, finger motion is more subtle and more prone to occlusion. In this section, we propose to extract finger motion and correlate it with note onsets.

We apply OpenPose again to extract the positions of all the finger joints from each player. Due to the limited video resolution and occlusion, the result is not robust enough to estimate the motion. Inspired by our previous work (), we use optical flow estimation () to capture this subtle motion at the pixel level. To reduce the computational cost, we set a region of interest (ROI) around the detected finger joints from OpenPose for optical flow estimation. The ROI is centered at the median of all the finger joints for each hand, and spans to cover all the joints. Similar to body skeletons, we smooth the joint coordinates using a moving average filter with a window size of 5 frames. Then we compute the optical flow estimation inside the ROI. Again, to eliminate the rigid and affine motion, each optical flow vector is reduced by the average motion vector of the ROI, resulting in a motion vector u(i j)(t) at pixel (i, j) and frame t. Figure 4 takes one flute player and one clarinet player as examples to visualize the optical flow estimation of one-hand finger motion in five consecutive frames, where the estimated finger joint positions are overlaid on the first video frames.

Figure 4 

Optical flow visualization of finger motion in five consecutive frames corresponding to note changes. The color encoding scheme is adopted from Baker et al. ().

3.3.2. Onset Likelihood

On each frame we take the maximum value of pixel-wise motion magnitude |u(i j)(t)| across all the pixels in the ROI as the motion flux, which captures the finger movements corresponding to pitch changes and is directly considered as onset likelihood φf(t) from finger motion. Figure 5 plots an onset likelihood curve φf(t) along with the associated and temporally aligned score part as piano-roll. We can observe salient motion flux around most note onset frames. Compared to Figure 3, the correspondence of note onsets to fingering motion for woodwind and brass instruments is not as robust as that to body motion for string instruments. The observation can be attributed to the fact that fine-grained motion is more sensitive to irrelevant motion. In addition, repeated notes for wind instruments are usually not reflected by finger maneuvers on the keys.

Figure 5 

Example correspondence between finger motion and note onsets of a flute player. Top: temporally aligned score part with onsets marked by red circles. Bottom: extracted motion flux from finger movements.

Analogous to Eq. (2), the pair-wise matching score from finger motion can be calculated as:

(3)
Mf[p,q](t)=τ=0tϕf[p](τ)ψ[q](τ).

3.4. Pitch Correspondence with Vibrato Motion

In addition to the onset time, variations of acoustic features throughout the entire process of some note articulations show correspondence with specific motion. Vibrato is one such feature. Vibrato is a commonly used artistic note articulation method to color a tone and express emotions in music performances. Physically, vibrato is generated by pitch modulation of a note in a periodic fashion. For string instruments, vibrato is often visible as the left hand rolling motion on the fingerboard. The relationship between visible motion and vibrato motivates us to follow our previous work () to extract the fine motion and find the correspondence with pitch contours extracted from the audio modality.

3.4.1. Vibrato Motion Extraction

We retrieve the finger motion u(i j)(t) as computed from the previous section. Although the vibrato motion is mostly a rigid motion (fingers move together with little relative movement), it is periodic and very fast (usually about 4–7.5 Hz ()), and hence it is not removed as slower rigid/affine motions are. Figure 6 illustrates several frames of the optical flow estimation of the vibrato hand motion for the two players. For each frame t, we take the average motion vector across all pixels within the ROI as u(t) = [ux(t), uy(t)]T, where the motion direction is preserved for vibrato detection.

Figure 6 

Optical flow visualization of hand motion corresponding to vibrato articulation. The color encoding scheme is adopted from Baker et al. ().

The vibrato detection module works as a binary classifier as proposed and trained by Li et al. (). The classifier is implemented as a support vector machine (SVM) that takes as input an 8-D feature extracted from each sample, including the zero crossing rate of the x- and y- motion velocities and their auto-correlations, the energy in the 3–9 Hz frequency range, and the auto-correlation peaks. According to Li et al. (), this method achieves a vibrato detection accuracy of over 90%, regardless of the polyphony number and instrument type within the string instrument family. Here each input sample is a 1-second segment of u(t) (again introducing an average 0.5-second delay of the association system).

For each detected vibrato segment, we perform PCA on u(t) within this 1-second segment to obtain the 1-D principal motion velocity curve v(t). We then integrate v(t) over time to calculate a motion displacement curve, d(t), which corresponds to the length fluctuation of the vibrating string, and hence the pitch fluctuation of the note. We normalize each vibrato segment of d(t) to zero mean and unit variance. We set the non-vibrato segments of d(t) to zero.

3.4.2. Pitch Contour Extraction

Utilizing the score information, we apply Soundprism (), an online score-informed source separation system, to separate the polyphonic audio mixture into individual sources. Note that although audio recordings of individual instrumental tracks are available in the dataset, we do not use them as they are not generally available in real concert scenarios. To extract the pitch contour, we perform a score-informed pitch estimation step on each separated audio source, as described by our previous work (). The pitch contour of each note segment is normalized to have zero mean and unit variance, and is denoted as f(t). The normalization operation discards the original pitch height information, and only preserves the pitch changes from the central frequency within each note. Figure 7 plots a 1-second segment of the normalized pitch contour overlaid with a motion displacement curve from the associated track (left) and a random track (right). Similar to Eqs. (2) and (3), we calculate the vibrato correspondence as:

(4)
Mv[p,q](t)=τ=0td[p](τ)f[q](τ).
Figure 7 

The same segment of normalized pitch contour f(t) (green) overlaid with the motion displacement curve d(t) (black) from the associated track (left) and another random track (right).

3.5. Integrating All Correspondences

We integrate the three modules to calculate the pair-wise correspondence between visual motion and score or audio events considering both onset timing and the entire note articulation process. The calculation is presented as

(5)
M[p,q](t)=wb(t)M¯b[p,q](t) +wf(t)M¯f[p,q](t)+wv(t)M¯v[p,q](t),

where M¯b[p,q](t), M¯f[p,q](t), and M¯v[p,q](t) represent the normalized correspondence across all of the pair-wise combinations between N players and N tracks as:

(6)
M¯i[p,q](t)=Mi[p,q](t)p,q=1NMi[p,q](t),i{b,f,v},

and wb, wf, wv represent the weighting parameters to re-scale the normalized correspondences from different modules. Weight wv is set as 2wf, to place greater emphasis on finger motion with vibrato patterns. Weights wb and wf are linearly related to their motion salience/flux in the past frames as

(7)
wb(t)wf(t)=τ=0ts(τ)τ=0tϕf(τ),

The linear relationship recovers the original scale of body and finger motion to weight the correspondences Mb(t) and Mf(t). It allows the system to focus on the part with stronger motion cues, such as body motion for string instrumentalists and finger motion for wind instrumentalists. In Section 4, we test the components in isolation as well as some combinations of them.

For an ensemble with N players, the number of possible associations is the factorial of N. Let σ(·) be a permutation function from p ∈ [1, N] to q ∈ [1, N] that represents one association candidate, where the p-th player is associated with the σ(p)-th track. For each association candidate σ, we calculate an overall association score as the product of the N pair-wise correspondence values. The final association solution σˆ is returned to maximize the association score as:

(8)
σ^=argmaxσp=1NM[p,σ(p)]=argminσp=1NlogM[p,σ(p)].

The replacement of product with sum of negative logarithms makes the efficient Hungarian algorithm () directly applicable for finding the best association.

4. Experiments

4.1. Dataset

The proposed source association system is evaluated on the URMP dataset (). To our best knowledge, this is the only publicly available multi-track audio-visual music performance dataset that is suitable for our evaluations. It contains 44 classical chamber ensemble pieces ranging from duets to quintets, assembled from 149 individually recorded tracks. Each piece comes with an audio recording (48 kHz, 24 bits) of the ensemble performance along with the audio recording for each individual instrument track, an assembled video recording (1080P, 29.97 FPS) of all instrumentalists as a whole, pitch and note annotations for each track, and the corresponding MIDI file as music score. In the assembled video recording, players are arranged horizontally from left to right, with the right-front side exposed to camera. The video has a static view without camera panning or zooming or shot transitions during the whole performance. The whole dataset is accessible from ().

We further expand the dataset by creating all possible track combinations within each piece. In the expanded set, audio is remixed from the provided individual audio tracks. For videos, we directly use the estimated pose of each player from the original video ensembles for augmented track combinations. This process gives equivalent results as if we first create the assembled videos of the augmented instrumental combinations and then run OpenPose on these assembled videos, but requires less computation in the experiments. For the example of a quartet, we further generate 6 duets and 4 trios from the 4 original tracks. Note that we do not combine tracks across pieces, to ensure the naturalness of the expanded set. The total expanded dataset comprises 171 duets, 126 trios, 47 quartets, and 7 quintets. The number of pieces for different instrument arrangements are listed in Table 1.

Table 1

The number of pieces for different instrument arrangements from the original and expanded URMP dataset.

StringWindMixedTotal

OriginalDuet26311
Trio26412
Quartet56314
Quintet2417

ExpandedDuet579123171
Trio416520126
Quartet1525747
Quintet2417

To further understand the dataset, we calculate the onset overlap rate for each original piece. This statistic is defined as the percentage of onset positions that are shared by two or more tracks for each piece. This statistic is relevant to the performance of the proposed source association approach, as two out of the three motion analysis modules rely on onset patterns to associate players with tracks. Figure 8 plots this statistic for all of the original 44 pieces. While the rate varies much from one piece to another, we see a general increasing trend as the polyphony increases.

Figure 8 

Onset overlap rate for each piece from the original URMP dataset.

4.2. System Setup

For implementation, the audio is processed with a frame length of 42.7 ms and a hop size of 10 ms for score following and pitch contour extraction. When calculating the vibrato correspondence, the motion curve extracted from the 29.97-frame-per-second (FPS) video is up-sampled to 100 FPS, enabling a synchronized time resolution between the audio and video. As vibrato detection is performed on 1-second segments and the onset likelihood curve from body motion is derived from a local maximum within future 30 video frames (≈1 second), the system has a 1-second inherent delay when it runs for real-time applications. The past 5 seconds of body and finger motion velocities are stored in memory to apply PCA (described in Section 3.2.1) and to calculate the weighting parameters in Eq. (7).

For evaluation, we first address each track independently to investigate the quality of the extracted onset likelihood features, using the traditional onset detection measures. Then we evaluate the association performance on the expanded set of ensemble pieces. The results are grouped by different ensemble types and sizes, from duets to quintets, which directly correlates with the difficulty level. Note that whatever number of tracks are present in the performance, only one association is correct. We do not include a quantitative evaluation of the score following and vibrato detection modules in this paper, since they have been fully evaluated in the previous work.

4.3. Onset Detection Evaluation

As two modules of the proposed system rely on the synchronization cues of onset timing between different modalities, we evaluate the quality of our proposed onset likelihood curves that are extracted from body motion and finger motion. To do so, we set up an onset detection task. We take the onset likelihood curve as the onset detection function (), and perform peak-picking to retrieve the onsets. A true positive detection is counted when a detected onset is within a tolerance window of 3 video frames (100 ms). This is wider than the standard 50ms in the literature, since the precise timing is not the main focus of the source association system.

Figure 9 plots the precision versus recall by varying the peak-picking threshold on the onset likelihood curves extracted from body motion and finger motion respectively. Precision and recall are calculated for each instrument across all pieces in the original dataset. Observing Figure 9 reveals that the onset likelihood curve extracted from body motion shows better correlation with the ground-truth onset timings for string instruments, while, that from finger motion shows better correlation for woodwind and brass instruments. An exception is trombone, where the onset likelihood curve extracted from body motion shows better correlation than that from finger motion. This observation is not surprising as the trombone pitch change (hence note transition) is mainly performed by moving the slide using the right arm (body motion).

Another interesting observation is that although the onset likelihood curve φb(t) in Figure 3 is visually less noisy than φf(t) in Figure 5, the recall calculated from φb(t) for string instruments cannot reach as high a value as that of wind instruments calculated from φf(t). We argue that this is because legato bowing (i.e., articulating a sequence of notes from one sustained bowing action) is widely used in string instrument performances, where onset detection from bow motion misses some true positives. This explains the upper bound of recall rates (around 80% as in Figure 9(a)) for string instruments. For wind instruments, there are also onsets that are not visible such as repeated notes, but the number is much smaller, which explains why the recall rates can reach closely to 100% in Figure 9(b).

Figure 9 

Onset detection evaluation results from: (a) body motion, and (b) finger motion, for different instruments.

4.4. Source Association Evaluation

In this section we evaluate the source association performance, first for each module (corresponding to each component in Eq. (5) independently, then for the finally integrated approach. We use association accuracy as the evaluation measure, which is defined as the percentage of correctly associated pieces among all testing pieces. A piece is considered correctly associated if the exactly correct bijection between players and score or audio tracks is retrieved. Note that the difficulty of source association increases dramatically from small to large ensembles. In a quintet ensemble, there are in total 5! = 120 bijection candidates, and only one is considered correct. Therefore, we divide our evaluation based on the size of ensembles.

Besides the ensemble size, the length of the performance also affects the difficulty of the association problem, assuming longer pieces provide richer cues. In an online setting, we hope that the proposed system can retrieve the correct association as quickly as possible. Therefore, in the experiments, we segment the testing pieces into non-overlapping excerpts for each of the following lengths: 5, 10, 15, 20, 25, and 30 seconds. When doing so, we first remove the beginning and the last 5 seconds of each piece as the performance may not cover the entire length of those segments. This segmentation further expands the testing pieces to a large number of evaluation samples, totaling 17,574 samples, as presented in Table 2.

Table 2

The number of evaluation samples with different length and instrumentation for source association.


StringExcerpt duration (sec)
51015202530

Duet1323642420303236200
Trio1044506333240189158
Quartet355172114826554
Quintet643121151210

WindExcerpt duration (sec)
51015202530

Duet1809887557435323266
Trio1275626391309229187
Quartet4742321451158668
Quintet66322016129

MixedExcerpt duration (sec)
51015202530

Duet441203141968260
Trio380174121827051
Quartet1999264443728
Quintet22107543

4.4.1. Body Motion

We first evaluate the source association performance using the normalized onset correspondence M¯b between score parts and body motion (the first component of Eq. (5)). Figure 10(a)–(c) shows the association accuracy for ensembles consisting of string, wind, and mixed instruments with various levels of polyphony. Note that the “All Ensembles” evaluated in Figure 10(c) and (f) contain all the instrument categories from Table 2, i.e., String+Wind+Mixed. For each piece, we plot how the association accuracy varies as the duration of the input stream increases from 5 to 30 seconds. Each marker in the figure is the association accuracy calculated from the number of excerpts shown in Table 2.

Figure 10 

(a)–(c): Source association accuracy only using onset correspondence between score parts and body motion (the first component M¯b in Eq. (5)). (d)–(f): Source association accuracy only using onset correspondence between score parts and finger motion (the second component M¯f in Eq. (5)).

Comparing different ensemble sizes, the association accuracy decreases as the number of players or tracks increases. From Figure 10(a), we find that correlating onsets with body motion is beneficial for string instruments. Note that this evaluation is reproduced from our previous work () as one baseline system here. The accuracy increases as the duration of video stream increases, which provides more cues to solve the association. The accuracy reaches around 90% for all ensemble sizes when the video stream duration reaches 30 seconds. This strategy based on onset correspondence from body motion, however, is not effective for wind instruments, where the association accuracy remains around random guess accuracy as shown in Figure 10(b), e.g., 1/6 for trios. This observation is consistent with our expectations and the onset detection evaluations in Figure 9.

4.4.2. Finger Motion

We then evaluate the source association performance using the normalized onset correspondence M¯f between score parts and finger motion (the second component of Eq. (5)). The association accuracy is plotted in Figure 10(d)–(f), with the same set of pieces used for the evaluations plotted in Figure 10(a)–(c). From Figure 10(d)–(f) we can observe that finger motion is a more prominent cue for note onsets for wind instruments (except for trombone). When a 30-second video excerpt is available, the association accuracy reaches about 90% for all sizes of wind ensembles. These observations are also consistent with our onset detection evaluations in Figure 9. For string instruments, however, the extracted finger motion is mostly vibrato motion, which is not relevant to note onsets.

Figure 10 also reveals some limitations of the source association solution based on onset-motion correspondence. First, there are many note onsets not revealed by body or finger motion, such as notes played with legato bowing for string instruments and repeated notes from wind instruments, as analyzed in Section 4.3. Second, as note synchronization between players is fundamental to ensemble performance, note onsets between tracks have high chances to overlap with each other, as shown in Figure 8. These limitations restrict the association performance for approaches that only rely on onset-motion correspondence, especially from short video excerpts.

4.4.3. Vibrato Motion

The correspondence between pitch fluctuations and vibrato motion (denoted as M¯v, the third component of Eq. (5)) helps to retrieve the source association on a finer level for string instrumentalists. The evaluation result is plotted in Figure 11(a) for the same set of pieces performed by string ensembles used for evaluations plotted in Figure 10(a). Note that this baseline is the same system as proposed in our previous work (). We do not include the wind instrument group here since no vibrato pattern can be detected from finger motion. We find that the source association reaches a high accuracy from shorter video clips, i.e., 90% after 10 seconds. The limitation of this approach is that vibrato articulation is not guaranteed to be always present in the performance. We thus combine this module with the onset correspondence from body motion, the two dominant cues to solve association for string instruments, to evaluate the association accuracy as shown in Figure 11(b). The two components from M¯b and M¯v work together to reach a high association accuracy from a short video stream.

Figure 11 

Source association accuracy of string ensembles by (a) only using vibrato correspondence between pitch fluctuation and hand motion (M¯v in Eq. (5)), and (b) combining vibrato correspondence with onset correspondence from body motion (M¯b and in M¯v Eq. (5)).

4.4.4. The Integrated System

Finally, we evaluate the proposed complete source association system after integrating all the modules together, as presented in Eq. (5). The evaluated pieces are the same as the ones plotted in Figure 10. This presents a universal source association system for common melodic instruments. Overall wind instruments have less chance to retrieve the correct association than string instruments, since only the subtle finger motion contributes to the correspondence with onset events. This correspondence is often inaccessible due to overlapping onsets across tracks or repeating onsets as analyzed in Section 4.4.2. Comparing Figure 12(a) with Figure 11(b), or Figure 12(b) with Figure 10(e), we observe that adding components with irrelevant association cues does not harm the system, thanks to the weighting strategy in Eq. (5) over different modules. Comparing Figure 12(c) with Figure 10(c) and (f), the integrated system greatly improves the association accuracy for pieces with mixed types of instruments. The association accuracy for mixed ensembles is between that of pure string and wind ensembles.

Figure 12 

Source association accuracy of ensembles with different instrumentation using all of the three modules: onset correspondence from body motion, onset correspondence from finger motion, and vibrato correspondence from hand motion (Eq. (5)).

4.5. Discussion

The proposed source association system is designed and evaluated for the online scenario, where all the system components do not rely on the performance data after the current time instant. Note that due to the limitations of the dataset, we have not systematically evaluated the robustness of the system against camera viewpoint changes. However, we argue that this will not be a big problem for the proposed system, as all the rigid/affine motions are easy to eliminate by setting up reference points (e.g., players’ hips) after extracting the skeleton data for each player. Another challenge in a real-world application is introduced by camera shot transitions in music video post-production. One suggested strategy is to clear the accumulated association scores and re-register the players when a shot transition is detected. But further experiments need to be conducted to validate this strategy. Another limitation of the experiments is that all the players in the dataset have their front-right side facing the camera with most finger motion visible. If this is not satisfied in real scenarios, only the first computation module (correspondence between body motion and note onsets) provides useful information, making the system only work for string ensembles.

5. Conclusion

In this paper, we propose an online source association system for Western chamber ensembles, which aims to retrieve the association between players in the video and the audio or score tracks, through the analysis of cross-modal temporal correspondences. We designed three modules to model different correspondences between 1) body motion and note onsets, 2) finger motion and note onsets, and 3) vibrato motion and pitch fluctuations. Although these correspondences apply to different kinds of instruments, the proposed system automatically integrates them in an adaptive fashion, without the need for knowing the instrument types. This makes the system a universal framework for common instruments in Western chamber ensembles including strings, woodwind, and brass instruments. In addition, the system runs in an online fashion to update association results as the video stream progresses. Experiments with audio-visual recordings of performances with different levels of polyphony and instrumentation demonstrate that the accuracy of the proposed system increases with the length of video streams, and high accuracy is achieved within a relatively short interval. The accuracy for string ensembles is generally better than that for woodwind, brass, and mixed-instrument ensembles because more correspondences are modeled for these instruments.