AI Generated Music is a solution in search of a problem.

06 Apr 2023 - by: Darkwraith Covenant

How I use Algorithmically generated music.

When I was studying music synthesis at Berklee, I became fascinated with the concept of algorithmic chance-based music. Aleatoric music is actually quite old, dating back as far as the Renaissance era in Europe. It was later popularized by 20th century composers like John Cage, who I was heavily into at the time. I was experimenting with using MaxMSP to create free jazz inspired synthetic music with random number generators and midi note quantizers. Today, I use a number of generative tools to make music in Darkwraith Covenant. Euclidean sequencing with modular synthesizers is featured on a number of tracks on Demonstrational Document v2.1 for example.

I would someday like to build a bassline generator for my music. My plan is to run my basslines – and some well known, popular bassline styles – through a machine learning algorithm, training the raw rhythm and note data. This data would be converted into binary notation (where 1 is a note event and 0 is a rest event), using midi note numbers to determine pitch. This trained data will be used to generate new basslines via a sequencer. It would have parameters for sequence length, note density, style, etc. Clearly, I am no stranger to music made using artificial intelligence, and I am not against the use of AI to create music. While this tool would be technically a type of “AI,” it differs in a number of ways from the current type of generative AI that has recently set the tech world ablaze.

How Google’s Music LM works

Google’s Music LM text-to-music model is one example of recent attempts to use machine learning to create entire pieces of music prompted from a text input. This new type of AI doesn’t generate bits of musical performance data via MIDI or some other form of music notation. It actually creates the entire .WAV file at the sample level, constructing the file in its entirety from chunks of data. Think of it like a robot who has listened to several thousands of hours of music and can spit out a .WAV file based on that body of musical data. This robot has a textbook with audio examples that it learns from, which is descriptively labeled so it can learn what a cello is for example, if the audio is of low fidelity, if there is a lot of reverb in the room, if a man is singing, etc. This is what’s in the MusicCaps data set, which consists of a table of human labeled captions describing what’s in a set of audio clips taken from 5,521 youtube videos. The Google colab notebook for exploring the dataset is here.

AI Generated music has inherent flaws.

Music LM doesn’t build the song piece by piece like a human musician would, creating first a drum part, then a bassline that works with the drums, adding a lead line, vocals, and so on, that together create a cohesive groove. This “thought” and planning that normally goes into constructing a piece of organized sound (what we commonly refer to as music) is completely absent from this process, as is the use of microphones to pick up acoustic sound waves from a source that exists in meatspace.

I think this approach is inherently flawed. I don’t know if any of the designers of Music LM are musicians, but I would expect computer science literate musicians to understand why a tool like this isn’t very useful for us and why this method very much misses the point of music creation. After all, the model doesn’t “know” that it’s creating music. The sound produced is the machine learning algorithm’s best approximation of what it thinks the user who typed in the prompt is trying to hear.

Some of the results can be quite good and strikingly convincing as a piece of music that could have been created by humans, especially when generating electronic music typically made using drum computers, sequencers, synthesizers, and software. Other results can be total garbage, creating what I call uncanny music.

In the below example, – which Google has labeled as “melodic techno” – if you weren’t paying close attention or weren’t familiar with how music works, you wouldn’t notice how strangely alien it sounds. The tones are otherworldly, eerie, and quite strange, only sort of approximating what one would consider to be melodic techno.

This next example often falls into an audible uncanny valley, similar to how AI generated art using deep learning models like Stable Diffusion renders creepy, uncanny oddities like squid fingers on people. Many of the other examples are no better and still only can muster an ephemeral facsimile of human created music.

Prompt: Epic soundtrack using orchestral instruments.
The piece builds tension, creates a sense of urgency.
An a cappella chorus sing in unison, it creates a sense of
power and strength.

I want to be fair to the designers of Music LM. This emergent field of computer science is cutting edge, difficult, combines a number of disciplines like neuroscience and data science, and the developers are undoubtedly all brilliant. I do think with time, refined training, fine-tuning, and further development, these models will get better at producing complete pieces of music that sound convincing enough to musically untrained ears.

For people looking for stock music, hold music, background music, elevator music, and other forms of commodified, throwaway corporate art, it will be a blessing, especially for those who are looking to save money instead of hiring a human composer. That is problematic in and of itself, but this post is not about the economic impact of these tools.

I could see a better approach being a process which trains an AI to actually build a piece of music more like a human would. Humans don’t put samples into a buffer one little blip at a time and render the results into a digital sound wave. For more musical results, the AI ideally would build a bassline, design a drum part around it, add leads that fit well with the bass, add in percussion and so on, to create a cohesive piece of music.

It would use a program like Ableton Live to build the music track-by-track, part-by-part, until it had satisfied the parameters given by the user. It would compile those pieces into an arrangement, applying processing and effects, creating a mastering chain, then rendering them out into a .WAV file. Music LM gets around all of this and synthesizes the whole performance entirely without any regard for how music is actually composed.

But even if you fine tuned the model to perfection, it’s still only as good as the data it was trained on. Garbage in, garbage out, as the comp-sci saying goes. It can only generate what it has been exposed to in the training corpus. It cannot spontaneously conceive something that has never existed, at least not in the same way a human brain can. It cannot jam or improv in any meaningful way. It still has to rely on its training data for its output, whereas humans are great at pulling things out their asses! You can try prompting it to mash together disparate elements to attempt to create something new, but the results are rarely ever very good.

Humans have an advantage over machines in this case, in that they can artfully meld disparate influences together to create something new, exciting, and interesting, because people have actual motivations and emotion behind their intentions to create something. Ben Goertzel, a pioneer of generative AI who coined the term “Artificial General Intelligence” or AGI – a term that has trended on Twitter among technology and computing circles recently – recently spoke about this in a talk posted on youtube. He explained how you could feed West African drumming, Western Classical music, Black spirituals, and church hymns into a neural network, but it would never spontaneously create Jazz. Given the exact same inputs – in this case the influences that early Jazz musicians were inspired by – the algorithm will never come up with Jazz, no matter how many times you re-roll the dice. Instead, it will likely generate a strange mashup of those genres, sounding nothing like Charlie Parker or John Coltrane. AI is too “literal” in its interpretations, which is why it is a poor substitute for human ingenuity and creativity at this moment in its development.

In this example, the prompt is “accordion rock” which to me doesn’t sound very “rock” to begin with.

I think if you asked a musician to spontaneously come up with accordion rock, it would probably sound more like this:

AI generated music, like other forms of AI art, won’t be replacing humans any time soon

Well worn tools for stochastically generating new musical ideas already exist. The Mutable Instruments Grids module is a drum pattern generator that creates a kick, snare, and hi-hat pattern based on a machine learning trained X/Y grid of popular drum rhythms. Euclidean sequencers like vpme.de’s Euclidean Circles already can create patterns based on geometric equations that sound very musical and interesting. These patterns have organically manifested in a number of musical genres, especially in non-Western music forms such as traditional African and Caribbean drumming rhythms. These tools already work very well for generating new musical ideas on the fly.
It’s a solution in search of a problem. We have no shortage of recorded music, musicians, and composers on the planet with more being created each day, many of whom would rather work for free than be automated out of a job.
Humans will still need to guide these outputs to make something that’s objectively good and not completely soulless uncanny garbage. I could see it being great for things like demos, basic riffs, or to help inspire new ideas, but what makes music emotionally resonant is those subtle human touches that AI still can’t quite seem to replicate. Not with methods like Music LM.
In real world applications, there’s simply too much entropy in the universe for these tools to 100% guarantee they will roll an objectively good piece of art each time they are run, just as you are not likely to roll a 7 every single time you roll two six sided dice. It’s going to generate crap more often than not. This problem won’t be solved until you can automate “taste,” by which time AGI will have been achieved and we might have bigger issues to consider.
Humans need human-to-human interaction. That’s so far what sets us apart from machines. Until we get actual cyborgs, this is going to be a problem. This is why people still go see live music, even though they can simply just listen to a band on a recording. There is a human connection that human beings are hardwired to crave. Put an audience in front of an automaton that can sing and dance vs a human being that can do the same and the people will choose the human each time, with few exceptions. Even a holographic band like Gorillaz is the exception rather than the rule, and the characters themselves are behind the scenes performed live by real musicians. Daft Punk weren’t actual robots up in that giant pyramid and Kraftwerk still perform their entire set live rather than simply pressing a button on a machine.

Hominid evolution over a million years wired us to be social animals, that’s not going away any time soon. It also wired us to pick up on subtle cues that make things that aren’t quite human seem “off.” While I can certainly see potential in generative AI tools, and these models are only going to get better, they’re not going to be generating piano concertos to rival Beethoven anytime soon. Musicians probably don’t need to worry quite yet about being completely replaced by AI. We should probably be more worried about the fact that companies like Google are pushing for a world where companies can save a buck and maximize profits, at the expense of workers.

Back