Deep Learning Bitmaps to PCM, Audio fun with deep belief networks

Can we learn from video frames to produce audio? Our training set can be synchronized audio and video, whereby we train a deep belief network to convert a bitmap of a video frame into PCM audio.

My former master’s student Gregory Burlet wrote a masters thesis on Guitar transcription using deep learning. I thought I’d join the fray and try an idea I had with deep learning. Prior authors had relied on relatively simple features or reduced representations of data, such as re-sizing a bitmap or down-sampling audio, and used that raw data as features instead of more complicated summaries. Gregory used short time Fourier transforms (STFTs) to describe the input audio. I decided not to use audio as input, I wanted to associate video frame with audio.

Deep Learning Setup

Thus I set up a DBN like so:

Input: 64x64 gray scaled pixels -> 
             deep belief network -> 
               PCM audio (floating point samples)

The training data / validation data is whatever video I feel like. Different videos have different results. The output is the PCM audio of that frame. I thought wow gee if the DBN could produce PCM audio that would pretty interesting, there’s a lot of complicated things that go on in audio signals and if a DBN can do it well that’ll be really impressive.

Input frames were scaled down too 64x64 gray scaled bitmaps with each pixel represented as a value within [0,1]. Audio was monaural and resampled to 22050hz PCM floats.

Training took between 2000 and 7000 minutes per brain. Some brains were simple 4096 inputs -> 1000 units -> 735 outputs. Some were more complicated such as 4096 -> 1000 -> 1000 -> 1000 -> 735 or 4096 -> 2048 -> 1000 -> 1000 -> 735.

Training Data

In this repository I have provided numerous video examples and brains that you are free to play with.

armstrong-basic - This is a brain trained off of a video of John Armstrong et al. playing rock music with a theremin. See armstrong-basic/armstrong-basic.avi.webm or youtube. Network 4096 -> 1000 -> 735
lines-small - lines for clarinet by John Osborne a local Edmonton animator. Network 4096 -> 1000 -> 735. See vimeo
osborne-combined-big - trained off of a larger compilation of John Osborne videos:
seeing-a-sound-shallow trained from Seeing a sound quickly
KUNGFURY – Trained on KUNG FURY the movie. 4096 -> 2000 -> 1500 -> 1000 -> 735 . See kungfury.com and youtube
lines - - lines for clarinet by John Osborne. See vimeo . 4096 -> 1000 -> 1000 -> 1000 -> 735
ramshackletyping. Trained on a video I shot of the Olm? Typing. Network 4096 -> 1000 -> 735
seeing-a-sound-deeper from Seeing a sound quickly vimeo by John Osborne. Network 4096 -> 1000 -> 1000 -> 1000 -> 735

Observations

It produces audio! The audio isn’t great. The audio often responds to action on the screen. The audio doesn’t respond to theme or content. There is no memory. There is often repeating annoying noises.

It took between 2000 and 7000 minutes to train each brain on a CPU. Kung Fury wasn’t finished training by the time this was written.

The audio is awful, there’s often 30hz harmonics throughout the audio due to the cutting off of frame sounds and no windowing. Windowing can improve the situation but still induces 30hz noise.

I used CSound to reinterpret the sound as granular synthesis, that worked better but lost it’s on-time edge. Granular synthesis smears events.

See rendered examples section at the end of this document to see all rendered examples.

Armstrong-basic

Trained on armstrong-basic/armstrong-basic.avi.webm or youtube. This is A complicated scene filmed from a camera, not a lot visual difference. This I think leads to really blaring output for unseen animations.

For Alphabet conspiracy raw sounds awful, but the granular synthesis seems to work with the talking xray.

Raw
Granular

Osborne’s Etudes come out very loud but interesting:

Raw
Granular

I like the on-time response seen in the hand animation Ode to Jimi:

Kung Fury

See kungfury.com and youtube .

A large dataset seems to produce more pleasant PCM output.

1392098818.mkv

Some of the granular synthesis seems quite appropriate:

1392099724.mkv

20 second borys did not work so well: Raw

Human figures seem to have more effect on the sound

Fire sounds pretty good.

Fire

Kung Fury seems like a better sounding dataset / brain than others. Perhaps more data and deeper networks are much better?

Lines and lines-small

Lines for clarinet by John Osborne

Both do quite well trained on themselves:

But the smaller network seems to produce more interesting sound with Osborne’s seeing sound:

Perhaps I need to ensure that I’m properly training my network given the performance of the shallower network.

Osborne-combined-big

This dataset was a 15 minute long concatenation of some of the works of John Osborne. The results tend to sound a lot like the other networks.

Example

Fire sounds pretty good.

Fire

For granular synthesis Etude 2 stands out:

Etude2
Lines

ramshackletyping

This one illustrates what a lack of variation in training data can do. Just brutal noise.

Here’s some of the better tracks (less noise, still bad):

Essentially if you want really aggressive sounds, maybe train on less and overfit to the input?

Here’s it overfitting to itself:

ramshackletyping

Seeing a sound quickly

One problem with training on this video is there isn’t a lot of variation. It is very binary, on or off.

There seems to be little differentiation between deep and shallow in this case.

The lines for clarinet video is similar to the seeing a sound quickly video and works quite well:

Discussion

Activity of black is a natural choice, scratched film seems like a good input.

A wider range of training inputs leads to a more robust output, but a tighter higher accuracy brain seems to produce sonically interesting results.

In general everything sounds pretty similar so I am not impressed by the results of this experiment.

The difference between shallow and deep networks is not really that sonically evident.

A common interpretation seems to be that white is loud and black is not. This could be a problem.

Suggestions

This experiment sounds interesting and horrible at the same time! What can be done to improve the sound?

Every training set should include 30 seconds or so of black screen and white screens with silent audio. That way the system would keep black screens quieter how we expect them.
Use history, this is a very stateless approach. An RNN might be a great idea.
Is PCM the most effecient representation? If I want to produce sonically interesting perhaps I might do better in frequency space (STFT) or a vocoder space.
Color and past frames were not included. Furthermore no analysis of the frames were used either. Perhaps an Eigen-faces style of operation would work where by the bitmaps Eigen vectors / PCA components are used.

Conclusions

Briefly I’ll conclude, without prior context of prior frames or prior sound that was already output, the quality of the audio output is pretty low. Either we need way more data for training, which I don’t want to spend time on, or we need to add more context to the frame. There’s an inherent independence assumption: 1 frame of video induces 1 frame of audio. But consider that 1 guitar pluck induces an audible signal for a lot longer than the guitar pluck, so there’s a slight problem.

Yet what this shows is that you can produce associations even if it is slightly overfit and they can have some musical value.

We do not recommend generating raw PCM data, intermediate representations might be more appropriate.

Attribution

John Osborne is a local animator who I have been working with. His animation is great, but I’m not sure he likes any of the sounds I put to them :(

osborne-lines taken from Lines for Clarinet
osborne-etude-1 and osborne-etude-2 taken from Etudes
osborne ode-to-jimmie taken from Ode to Jimi
osborne-pulse – taken from Pulse
osborne-seeing-sound – taken from Seeing a sound quickly

Public domain images from Archive.org

015-loud_barking_and_guitar.1397370485.10527-out.15-loud_barking_and_guitar.wav.audio.mkv
114-tones.1397368837.20976-out.114-tones.wav.audio.mkv
1408297309.27876-out.caffeine.wav.audio.mkv
1408304868.8993-out.caffeine.wav.audio.mkv

Assume Public domain

Abram’s photos and images and video

20secondBorys.mp4
belch-kitchen-sample.mp4
drone-sample.mp4 – video of the Olm
govid3-oldsketch.mkv
MVI_9117.mov
osborne-seeing-sound.mp4
spikey-mouth-loop.mkv
VID_20130404_003435.mp4.1384674117.corpus.mkv
VID_20130531_132327.mp4.1384676233.corpus.mkv

Assume CC-BY 4.0 Abram Hindle

Public domain from Archive.org

alphabet-conspiracy.mp4 – Alphabet Conspiracy
Bimbo’s_Initiation_1931.mp4 – Max Ernst Bimbo’s Initiation 1931

I think these might have some images from Evelyn Berg in it:

1392098818.mkv
1392098671.mkv
1392099724.mkv

Assume CC-BY-NC.

Many ideas and inspiration are from Gregory Burlet:

https://peerj.com/preprints/1193/

Burlet G, Hindle A. (2015) Isolated instrument transcription using a deep belief network. PeerJ PrePrints 3:e1455 https://dx.doi.org/10.7287/peerj.preprints.1193v1

How to use this stuff

This repository is for support files and examples of applying mostly deep multilayered perceptrons (deep belief networks) to the task of converting video frames to PCM.

Training is simple, run pickler.py on a video and generate video.pkl and audio.pkl. Then run theanet.py to learn a brain between the 2. This can take more than a week for 30 minutes of video. Once a theanet.py.net.pkl is produce you can run render.sh and produce a rendered version of a video.

There are 2 render modes, raw and granular synthesis. Raw has issues with 30 hz harmonics (30fps) and granular synthesis isn’t always on time.

Current observations: the audio produced is high frequency, but the length of the output is not enough to produce continuous low frequency tones anyways. A lot of the output is noise.

Latest source code should be here:

Assume GPL3.0 license on all source code.

Assume GPL3.0 on all DBN pickles.

Abram Hindle's Blog

Software is Hard

Deep Learning Bitmaps to PCM