How to teach a computer to sound human
(You can listen to a podcast about this audio here).
It’s not often I hear audio that blows my mind.
Today I’ve received audio that will form the basis of a new piece of music.
It was made by neural synthesis.
By its nature, it’s unique and (to me at least) precious and extraordinary.
It was created using PRiSM SampleRNN, a computer-assisted compositional tool that generates new audio by ‘learning’ the characteristics of an existing corpus of sound or music.
I’m fortunate to be part of the Unsupervised project at the Royal Northern College of Music’s PRiSM research project.
PRiSM takes a lead in interdisciplinary and reflexive research between the creative arts and the sciences with a view to making a real contribution to society, to developing new digital technology and creative practice, and to addressing fundamental questions about what it means to be human and creative today.
Creating audio by neural synthesis is quite a long process.
It’s unpredictable. Consequently, it’s really exciting. You never know what you’re going to get at the end of it.
To begin the process, I made three hours’ worth of small wav files containing samples of human speech – tiny fragments of the kind of spontaneous verbal sounds humans make.
That’s a LOT of editing.
In this three hour dataset of audio there are small clips of speaking, breathing, laughing, shouting, whispering – all in multiple pitches and tones.
Next, the audio files are fed into the computer. Once that’s complete, the machine learning can start; the audio is ‘learned’ by the algorithm.
This takes the computer (a super-computer, no less) about two weeks. This part of the process was overseen by Dr Christopher Melen at the University of Manchester.
He managed to build a very large dataset from the material I supplied, in fact it was one of the largest he’s ever worked on!
Datasets are created by chopping up the original audio into many smaller chunks (typically 5-8 seconds long, perhaps with an overlap between consecutive chunks), then randomly shuffling them.
Audio was generated every 5 epochs, with 5 files generated at a time.
The files are suffixed with the epoch number and the temperature (t=0.95). The temperature is a parameter of the training which controls the amount of randomness in the output samples.
We’re going to experiment by changing the temperature to 0.99. We expect the generated sounds to be ‘busier’ but we don’t really know. It’s intriguing.
If the algorithm is tweaked even slightly, the computer-generated audio output can be significantly changed.
The sounds the computer has delivered contain riches and many surprises. There are files for saved epochs from the training run – one epoch represents a single pass through the complete training data.
The machine-learned audio is strange, other-worldly, oddly human but also not human. To anyone fascinated by sound, it’s bloody awesome.
The network is not designed to generate comprehensible speech.
It can’t learn language, in the sense of the meaning of words or sentences.
It’s not designed for that. It’s purely about sound.
We’ve found speech to be one of the most interesting sounds on which to train the network.
When I was editing the original files for the dataset, I wanted to make it as challenging as possible for the algorithm to make assumptions about the audio; I wanted to make it difficult for the computer to ‘learn.’
Creating the initial dataset is an important part of the creative and compositional process: what you include changes the flavour of the dish, so to speak.
I edited my original audio across words and pitches, I included and excluded silences, I separated audio that centred around stable pitches and volume, then mixed them together. I made folders of audio containing specific tones, pitches and durations so we could alter the algorithm accordingly to change the output.
Effectively, I wanted the computer to have to work hard to give me its best attempt at ‘learning human.’
The results have blown me away.
I’ve got so many ideas as to how I’m going to use it, layer it, deconstruct it, reassemble it.
I can’t wait to make a start.
The final yet-to-be-composed piece of music will incorporate live speech and instrumentation. The core of the work will be the computer generated, machine-learned audio created by the computer.
The idea behind this piece has a philosophical concept at its heart.
More about that once I’ve written it…!
The music will be presented (perhaps performed) at an event in the summer.
I’m so grateful to Dr Sam Salem, Dr Christopher Melen, PRiSM, the University of Manchester, the University of Oxford and the RNCM for their help and support with this project.
Listen to me talking about this audio here.
Learn more about the work I do in my sound studio here.
To learn more about my music, click here.