The Problem with Music Datasets

terry feng
3 min readFeb 5, 2023

Read about MusicLM: here | MusicLM Examples: here

Million Song Dataset (Dawen Liang, 2011)

Last week, Google published it’s latest research on generative music making, generating music from text. This state-of-the-art model, MusicLM, pushes the boundary on high fidelity generated audio that obeys a text prompt. Advocating for further work in this area, Google released MusicCaps, a dataset of music and professionally generated captions and descriptions.

MusicLM in some ways is a translation model: translate text using semantic tokens (and acoustic ones in this case) accurately into the sonic and acoustic realm. This notion of targeting adherence to text prompt accuracy is a interesting but really restrictive; it is simply the standard for numerous machine learning tasks. Tasks like speech recognition, image captioning, and classification are quantified for accuracy because there is some level of a undeniable truth: it is or it isn’t. There is a hot dog in this picture. Humans label the data. After that, let the model run and do the labeling in steed.

In the context of generative tasks, there really is no ground truth; that’s what makes it difficult. However some domains are more forgiving. In image generation, you can circumvent this challenge because images are static, thus to an extent quantifiable. Diffusion models like Dalle2 or Midjourney will always generate a quantifiable subject if the generated image is accurate. A large monkey riding a carousel unicorn. We all know what those objects are and in the image, they’re either there or they’re not. Of course colors, placement, rotation, style, those are up to discretion. But at some level, there’s ground truth.

MusicLM is not the only work that’s come out in recent years but it in some ways usurps the nature of the image domain, illogically porting it to the sonic domain. It’s easy to define what’s in the music, a piano, a quartet, a C#7b13 chord, but what is the music? Music is an art, but music is an expression. Datasets like MusicCaps rely on professional musicians to label music, it’s sonic qualities, harmonic content, even emotions that it evokes. But this is simply a caption. This should be used for an music captioning system. Why is it used for generation, or dare I say composition? You can label something as Ode to Joy, now that we all know Beethoven’s work, but what did it take to come up with Ode to Joy for the very first time in 1824? Why are music generation datasets not the other way around, simply a corpus of text prompts and artists interpret what those could mean? Because there’s way too much variability in those answers, there’s no way that the training would converge.

And there lies the problem of generating music. What does it mean to generate music? Generative music can’t be novel. Training data must be repeatable. Music when repeated is a copyright strike. Is MusicLM sonifying a thought translated to text or is it sonifying sound to match a dimension lacking label? Music is an interpretation as a means to expression, not simply “motivational music for sports” as Google likes to put it. If generating music is to be a task that dominates the audio machine learning sphere in popularity over the next few years, human expression needs to be at the forefront, not just 10 professional musicians labeling 5,500 songs by hand.

--

--