Music segmentation#

Info

Try this notebook in an executable environment with Binder.
Download this notebook here.

Introduction#

Music segmentation can be seen as a change point detection task and therefore can be carried out with ruptures. Roughly, it consists in finding the temporal boundaries of meaningful sections, e.g. the intro, verse, chorus and outro in a song. This is an important task in the field of music information retrieval.

The adopted approach is summarized as follows:

the original sound is transformed into an informative (multivariate) representation;
mean shifts are detected in this new representation using a dynamic programming approach.

In this example, we use the well-known tempogram representation, which is based on the onset strength envelope of the input signal, and captures tempo information [Grosche2010].

To load and manipulate sound data, we use the librosa package [McFee2015].

Setup#

First, we make the necessary imports.

import matplotlib.pyplot as plt
import librosa
import numpy as np
from IPython.display import Audio, display

import ruptures as rpt  # our package

We can also define a utility function.

def fig_ax(figsize=(15, 5), dpi=150):
    """Return a (matplotlib) figure and ax objects with given size."""
    return plt.subplots(figsize=figsize, dpi=dpi)

Load the data#

A number of music files are available in Librosa. See here for a complete list. In this example, we choose the Dance of the Sugar Plum Fairy from The Nutcracker by Tchaikovsky.

We can listen to the music as well as display the sound envelope.

duration = 30  # in seconds
signal, sampling_rate = librosa.load(librosa.ex("nutcracker"), duration=duration)

# listen to the music
display(Audio(data=signal, rate=sampling_rate))

# look at the envelope
fig, ax = fig_ax()
ax.plot(np.arange(signal.size) / sampling_rate, signal)
ax.set_xlim(0, signal.size / sampling_rate)
ax.set_xlabel("Time (s)")
_ = ax.set(title="Sound envelope")

Downloading file 'Kevin_MacLeod_-_P_I_Tchaikovsky_Dance_of_the_Sugar_Plum_Fairy.ogg' from 'https://librosa.org/data/audio/Kevin_MacLeod_-_P_I_Tchaikovsky_Dance_of_the_Sugar_Plum_Fairy.ogg' to '/home/runner/.cache/librosa'.

No description has been provided for this image

Signal segmentation#

Transform the signal into a tempogram#

The tempogram measures the tempo (measured in Beats Per Minute, BPM) profile along the time axis.

# Compute the onset strength
hop_length_tempo = 256
oenv = librosa.onset.onset_strength(
    y=signal, sr=sampling_rate, hop_length=hop_length_tempo
)
# Compute the tempogram
tempogram = librosa.feature.tempogram(
    onset_envelope=oenv,
    sr=sampling_rate,
    hop_length=hop_length_tempo,
)
# Display the tempogram
fig, ax = fig_ax()
_ = librosa.display.specshow(
    tempogram,
    ax=ax,
    hop_length=hop_length_tempo,
    sr=sampling_rate,
    x_axis="s",
    y_axis="tempo",
)

Detection algorithm#

We choose to detect changes in the mean of the tempogram, which is a multivariate signal. This amounts to selecting the \(L_2\) cost function (see CostL2). To that end, two methods are available in ruptures:

rpt.Dynp(model="l2")
rpt.KernelCPD(kernel="linear")

Both will return the same results but the latter is implemented in C and therefore significatively faster.

Number of changes#

In order to choose the number of change points, we use the elbow method. In the change point detection setting, this heuritic consists in:

plotting the sum of costs for 1, 2,...,\(K_{\text{max}}\) change points,
picking the number of changes at the "elbow" of the curve.

Intuitively, adding change points beyond the "elbow" only provides a marginal decrease of the sum of costs.

Here, we set \(K_{\text{max}}\):=20.

Note

In rpt.Dynp and rpt.KernelCPD, whenever a segmentation with \(K\) changes is computed, all segmentations with 1,2,..., \(K-1\) are also computed and stored. Indeed, thanks to the dynamic programming approach, segmentations with less changes are avalaible for free as intermediate calculations. Therefore, users who need to compute segmentations with several numbers of changes should start with the one with the most changes.

In addition, note that, in ruptures, the sum of costs of a segmentation defined by a set of change points bkps can easily be computed using:

algo = rpt.KernelCPD(kernel="linear").fit(signal)
algo.cost.sum_of_costs(bkps)

(Replace rpt.KernelCPD by the algorithm you are actually using, if different.)

# Choose detection method
algo = rpt.KernelCPD(kernel="linear").fit(tempogram.T)

# Choose the number of changes (elbow heuristic)
n_bkps_max = 20  # K_max
# Start by computing the segmentation with most changes.
# After start, all segmentations with 1, 2,..., K_max-1 changes are also available for free.
_ = algo.predict(n_bkps_max)

array_of_n_bkps = np.arange(1, n_bkps_max + 1)


def get_sum_of_cost(algo, n_bkps) -&gt; float:
    """Return the sum of costs for the change points `bkps`"""
    bkps = algo.predict(n_bkps=n_bkps)
    return algo.cost.sum_of_costs(bkps)


fig, ax = fig_ax((7, 4))
ax.plot(
    array_of_n_bkps,
    [get_sum_of_cost(algo=algo, n_bkps=n_bkps) for n_bkps in array_of_n_bkps],
    "-*",
    alpha=0.5,
)
ax.set_xticks(array_of_n_bkps)
ax.set_xlabel("Number of change points")
ax.set_title("Sum of costs")
ax.grid(axis="x")
ax.set_xlim(0, n_bkps_max + 1)

# Visually we choose n_bkps=5 (highlighted in red on the elbow plot)
n_bkps = 5
_ = ax.scatter([5], [get_sum_of_cost(algo=algo, n_bkps=5)], color="r", s=100)

Visually, we choose 5 change points (highlighted in red on the elbow plot).

Results#

The tempogram can now be segmented into homogeous (from a tempo standpoint) portions. The results are show in the following figure.

# Segmentation
bkps = algo.predict(n_bkps=n_bkps)
# Convert the estimated change points (frame counts) to actual timestamps
bkps_times = librosa.frames_to_time(bkps, sr=sampling_rate, hop_length=hop_length_tempo)

# Displaying results
fig, ax = fig_ax()
_ = librosa.display.specshow(
    tempogram,
    ax=ax,
    x_axis="s",
    y_axis="tempo",
    hop_length=hop_length_tempo,
    sr=sampling_rate,
)

for b in bkps_times[:-1]:
    ax.axvline(b, ls="--", color="white", lw=4)

Visually, the estimated change points indeed separate portions of signal with a relatively constant tempo profile. Going back to the original music signal, this intuition can be verified by listening to the individual segments defined by the changes points.

# Compute change points corresponding indexes in original signal
bkps_time_indexes = (sampling_rate * bkps_times).astype(int).tolist()

for segment_number, (start, end) in enumerate(
    rpt.utils.pairwise([0] + bkps_time_indexes), start=1
):
    segment = signal[start:end]
    print(f"Segment n°{segment_number} (duration: {segment.size/sampling_rate:.2f} s)")
    display(Audio(data=segment, rate=sampling_rate))

Segment n°1 (duration: 1.76 s)

Segment n°2 (duration: 8.03 s)

Segment n°3 (duration: 8.46 s)

Segment n°4 (duration: 2.73 s)

Segment n°5 (duration: 5.97 s)

Segment n°6 (duration: 3.04 s)

The first segment corresponds to the soundless part of the signal (visible on the plot of the signal enveloppe). The following segments correspond to different rythmic portions and the associated change points occur when various instruments enter or exit the play.

Conclusion#

This example shows how to apply ruptures on a music segmentation task. More precisely, we detected mean shifts on a well-suited representation (the tempogram) of a music signal. The number of changes was heuristically determined (with the "elbow" method) and the results agreed with visually and auditory intuition.

Such results can then be used to characterize the structure of music and songs, for music classification, recommandation, instrument recognition, etc. This procedure could also be enriched with other musically relevant representations (e.g. the chromagram) to detect other types of changes.

Authors#

This example notebook has been authored by Olivier Boulant and edited by Charles Truong.

References#

[Grosche2010] Grosche, P., Müller, M., & Kurth, F. (2010). Cyclic tempogram - a mid-level tempo representation for music signals. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5522–5525.

[McFee2015] McFee, B., Raffel, C., Liang, D., Ellis, D. P. W., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: audio and music signal analysis in Python. Proceedings of the Python in Science Conference, 8, 18–25.