Abram Hindle's Blog

Software is Hard

Deep Learning Bitmaps to PCM

Deep Learning Bitmaps to PCM, Audio fun with deep belief networks

Can we learn from video frames to produce audio? Our training set can be synchronized audio and video, whereby we train a deep belief network to convert a bitmap of a video frame into PCM audio.

My former master’s student Gregory Burlet wrote a masters thesis on Guitar transcription using deep learning. I thought I’d join the fray and try an idea I had with deep learning. Prior authors had relied on relatively simple features or reduced representations of data, such as re-sizing a bitmap or down-sampling audio, and used that raw data as features instead of more complicated summaries. Gregory used short time Fourier transforms (STFTs) to describe the input audio. I decided not to use audio as input, I wanted to associate video frame with audio.

Deep Learning Setup

Thus I set up a DBN like so:

Input: 64x64 gray scaled pixels -> 
             deep belief network -> 
               PCM audio (floating point samples)

The training data / validation data is whatever video I feel like. Different videos have different results. The output is the PCM audio of that frame. I thought wow gee if the DBN could produce PCM audio that would pretty interesting, there’s a lot of complicated things that go on in audio signals and if a DBN can do it well that’ll be really impressive.

Input frames were scaled down too 64x64 gray scaled bitmaps with each pixel represented as a value within [0,1]. Audio was monaural and resampled to 22050hz PCM floats.

Training took between 2000 and 7000 minutes per brain. Some brains were simple 4096 inputs –> 1000 units –> 735 outputs. Some were more complicated such as 4096 –> 1000 –> 1000 –> 1000 –> 735 or 4096 –> 2048 –> 1000 –> 1000 –> 735.

Training Data

In this repository I have provided numerous video examples and brains that you are free to play with.

  • armstrong-basic – This is a brain trained off of a video of John Armstrong et al. playing rock music with a theremin. See armstrong-basic/armstrong-basic.avi.webm or youtube. Network 4096 –> 1000 –> 735
  • lines-small – lines for clarinet by John Osborne a local Edmonton animator. Network 4096 –> 1000 –> 735. See vimeo
  • osborne-combined-big – trained off of a larger compilation of John Osborne videos:
  • seeing-a-sound-shallow trained from Seeing a sound quickly
  • KUNGFURY — Trained on KUNG FURY the movie. 4096 –> 2000 –> 1500 –> 1000 –> 735 . See kungfury.com and youtube
  • lines – – lines for clarinet by John Osborne. See vimeo . 4096 –> 1000 –> 1000 –> 1000 –> 735
  • ramshackletyping. Trained on a video I shot of the Olm? Typing. Network 4096 –> 1000 –> 735
  • seeing-a-sound-deeper from Seeing a sound quickly vimeo by John Osborne. Network 4096 –> 1000 –> 1000 –> 1000 –> 735


It produces audio! The audio isn’t great. The audio often responds to action on the screen. The audio doesn’t respond to theme or content. There is no memory. There is often repeating annoying noises.

It took between 2000 and 7000 minutes to train each brain on a CPU. Kung Fury wasn’t finished training by the time this was written.

The audio is awful, there’s often 30hz harmonics throughout the audio due to the cutting off of frame sounds and no windowing. Windowing can improve the situation but still induces 30hz noise.

I used CSound to reinterpret the sound as granular synthesis, that worked better but lost it’s on-time edge. Granular synthesis smears events.

See rendered examples section at the end of this document to see all rendered examples.


Trained on armstrong-basic/armstrong-basic.avi.webm or youtube. This is A complicated scene filmed from a camera, not a lot visual difference. This I think leads to really blaring output for unseen animations.

For Alphabet conspiracy raw sounds awful, but the granular synthesis seems to work with the talking xray.

Osborne’s Etudes come out very loud but interesting:

I like the on-time response seen in the hand animation Ode to Jimi:

Kung Fury

See kungfury.com and youtube .

A large dataset seems to produce more pleasant PCM output.

Some of the granular synthesis seems quite appropriate:


20 second borys did not work so well: Raw

Human figures seem to have more effect on the sound

Fire sounds pretty good.

Kung Fury seems like a better sounding dataset / brain than others. Perhaps more data and deeper networks are much better?

Lines and lines-small

Lines for clarinet by John Osborne

Both do quite well trained on themselves:

But the smaller network seems to produce more interesting sound with Osborne’s seeing sound:

Perhaps I need to ensure that I’m properly training my network given the performance of the shallower network.


This dataset was a 15 minute long concatenation of some of the works of John Osborne. The results tend to sound a lot like the other networks.

Fire sounds pretty good.

For granular synthesis Etude 2 stands out:


This one illustrates what a lack of variation in training data can do. Just brutal noise.

Here’s some of the better tracks (less noise, still bad):

Essentially if you want really aggressive sounds, maybe train on less and overfit to the input?

Here’s it overfitting to itself:

Seeing a sound quickly

One problem with training on this video is there isn’t a lot of variation. It is very binary, on or off.

There seems to be little differentiation between deep and shallow in this case.

The lines for clarinet video is similar to the seeing a sound quickly video and works quite well:


Activity of black is a natural choice, scratched film seems like a good input.

A wider range of training inputs leads to a more robust output, but a tighter higher accuracy brain seems to produce sonically interesting results.

In general everything sounds pretty similar so I am not impressed by the results of this experiment.

The difference between shallow and deep networks is not really that sonically evident.

A common interpretation seems to be that white is loud and black is not. This could be a problem.


This experiment sounds interesting and horrible at the same time! What can be done to improve the sound?

  • Every training set should include 30 seconds or so of black screen and white screens with silent audio. That way the system would keep black screens quieter how we expect them.

  • Use history, this is a very stateless approach. An RNN might be a great idea.

  • Is PCM the most effecient representation? If I want to produce sonically interesting perhaps I might do better in frequency space (STFT) or a vocoder space.

  • Color and past frames were not included. Furthermore no analysis of the frames were used either. Perhaps an Eigen-faces style of operation would work where by the bitmaps Eigen vectors / PCA components are used.


Briefly I’ll conclude, without prior context of prior frames or prior sound that was already output, the quality of the audio output is pretty low. Either we need way more data for training, which I don’t want to spend time on, or we need to add more context to the frame. There’s an inherent independence assumption: 1 frame of video induces 1 frame of audio. But consider that 1 guitar pluck induces an audible signal for a lot longer than the guitar pluck, so there’s a slight problem.

Yet what this shows is that you can produce associations even if it is slightly overfit and they can have some musical value.

We do not recommend generating raw PCM data, intermediate representations might be more appropriate.


John Osborne is a local animator who I have been working with. His animation is great, but I’m not sure he likes any of the sounds I put to them :(

These videos are © John Osborne — assume similar rules to CC-BY-NC-ND

Public domain images from Archive.org

  • 015-loud_barking_and_guitar.1397370485.10527-out.15-loud_barking_and_guitar.wav.audio.mkv
  • 114-tones.1397368837.20976-out.114-tones.wav.audio.mkv
  • 1408297309.27876-out.caffeine.wav.audio.mkv
  • 1408304868.8993-out.caffeine.wav.audio.mkv

Assume Public domain

Abram’s photos and images and video

  • 20secondBorys.mp4
  • belch-kitchen-sample.mp4
  • drone-sample.mp4 — video of the Olm
  • govid3-oldsketch.mkv
  • MVI_9117.mov
  • osborne-seeing-sound.mp4
  • spikey-mouth-loop.mkv
  • VID_20130404_003435.mp4.1384674117.corpus.mkv
  • VID_20130531_132327.mp4.1384676233.corpus.mkv

Assume CC-BY 4.0 Abram Hindle

Public domain from Archive.org

  • alphabet-conspiracy.mp4 — Alphabet Conspiracy
  • Bimbo’s_Initiation_1931.mp4 — Max Ernst Bimbo’s Initiation 1931

I think these might have some images from Evelyn Berg in it:

  • 1392098818.mkv
  • 1392098671.mkv
  • 1392099724.mkv

Assume CC-BY-NC.

Many ideas and inspiration are from Gregory Burlet:


Burlet G, Hindle A. (2015) Isolated instrument transcription using a deep belief network. PeerJ PrePrints 3:e1455 https://dx.doi.org/10.7287/peerj.preprints.1193v1

How to use this stuff

This repository is for support files and examples of applying mostly deep multilayered perceptrons (deep belief networks) to the task of converting video frames to PCM.

Training is simple, run pickler.py on a video and generate video.pkl and audio.pkl. Then run theanet.py to learn a brain between the 2. This can take more than a week for 30 minutes of video. Once a theanet.py.net.pkl is produce you can run render.sh and produce a rendered version of a video.

There are 2 render modes, raw and granular synthesis. Raw has issues with 30 hz harmonics (30fps) and granular synthesis isn’t always on time.

Current observations: the audio produced is high frequency, but the length of the output is not enough to produce continuous low frequency tones anyways. A lot of the output is noise.

Latest source code should be here:

Assume GPL3.0 license on all source code.

Assume GPL3.0 on all DBN pickles.

Rendered Examples

Youtube and Content ID

I’ve been having a lot of issues dealing with erroneous and egregious copyright claims against my own videos that I upload to youtube!

Public Domain

From Archive.org I got a public domain copy of Battleship Potemkin. It is a terrible rip of the movie.

Bronenosets Potyomkin (Battleship Potemkin) (1925) https://archive.org/details/PhantasmagoriaTheater-BattleshipPotemkin1925396

Regardless, MOSFILMS and the creators of Potemkin did not renew the movie in the US for copyright. Meaning even though it was released in 1925, it is in the public domain due to inaction on the part of the MOSFILMS et al. Furthermore relations with Russia between Russia and United States at the time were questionable.

In Canada this movie is far past its PUBLIC DOMAIN due date and it is now in the public domain. Sergei Mikhailovich Eisenstein died in 1948 so 1998 was 50 years after his death. Meaning even by the strictest standard of public domain in Canada (50 years after death) the work is public domain. If it’s a performance it was performed in 1925 so 50 years after performances is even earlier.

See: http://en.wikipedia.org/wiki/Copyright_law_of_Canada#Public_domain

But what has happened? Many content holders on Youtube have laid claim to my posting of a soundtracked version of Battleship Potemkin.

The soundtrack was automatically generated by software, but the video was provided by http://archive.org. It is a public domain copy of the public domain movie. It isn’t a copyrighted copy like those produced by Kino films et al. who remastered the images effectively making a new work.

Nonethless the Youtube content ID has no taste for subtlely and various organizations have made claims against these videos. This affects my youtube account because it puts me into the proverbial dog house so to speak, where I cannot upload longer videos and limits my account in other ways. This stain on my account also threatens the content under claim.

VTR claims they own the Odessa Steps sequence in the movie, they don’t:


I am having a hard to figuring out who VTR is but I think they have a music video in their collection that has the Odessa Steps scene in it. Otherwise they would’ve claimed the whole movie and not that sequence. For that reason alone it is quite apparent that VTR has no claim to my version of the movie.

So just to emphasize the large number of claims I have been dealing with here is a HTML “screenshot” of 2 of my Potemkin videos:

Battleship Potemkin Soundtracked Video Texture 2: Strings

Your video may include content that is owned by a third party.

To watch the matched content please play the video on the right. The video will play from the point where the matched content was identified.

Your video is available and playable.

Here are the details:

  • Visual content administered by: 3:00;)

    Mosfilm Claim released.

  • Visual content administered by: 1:04;)

    egeda Claim released.

  • “Час истины-Час истины – Первая русская революция”, visual content administered by: 48:50;)

    Mediagates TV Claim released.

  • “Zoom- Start: Encouraçado Potemkin”, visual content administered by: 54:28;)

    Fundação Padre Anchieta (TV Cultura) Claim released.

  • Visual content administered by: 54:52;)

    VTR Broadcast Your Dispute awaiting response by 5/14/14

To learn more about how claims impact your videos click here.


Battleship Potemkin Metropolis Automatically Soundtracked

Here are the details:

  • Visual content administered by: 3:00;)

    Mosfilm Claim released.

  • “Час истины-Час истины – Первая русская революция”, visual content administered by: 48:50;)

    Mediagates TV Claim released.

  • Visual content administered by: 54:52;)

    VTR Broadcast Your appeal is awaiting response by 5/14/14

To learn more about how claims impact your videos click here.

Another Fight Over Public Domain Music

http://www.opengoldbergvariations.org/ The Open Goldberg Variations are a very nicely produced and 100% public domain rendition of Bach’s Goldberg Variations. It is beautiful and I enjoy using this music as source material, I could never play anything like it so I am grateful for its availability.

https://www.youtube.com/watch?v=Ua-PcbC5xMI This exceptionally sorry demo video used a sample of the music at https://www.youtube.com/watch?v=Ua-PcbC5xMI&t=3m00s . Also I use the music as input to a granular synthesis engine. I had numerous groups claiming they owned my music. Why? Because Bach is played a lot and lot of copyright holders will have a rendition of Open Goldberg Variations in their catalogue.

What was even stranger was that granular synthesis parts were being picked up by copyright holders such as CD-Baby the popular indie netlabel that lets you sell your music on CD online. I guess my granular synthesis output of my instrument sounded like someone else’s granular synthesis (very likely). Regardless they did not own my work.

This potentially jeopardized my acceptance into NIME 2014 (yeah!) as Youtube could’ve taken down my demo video.

Public Domain Summary

Thus you can see that many different rights holders are claiming this film, but often not the whole film or just segments that matched their catalogue. I wish Youtube’s content ID could understand that just because we use the same public domain work does not mean that either party has ownership over the content.

This is not a new problem

The gaming community has a beef with youtube ContentID but I think their claims are different than mine. My claims are MY content or PUBLIC DOMAIN content, and my ability to freely publish MY content and PUBLIC DOMAIN materials. The gamers are in a different world where they do not own the game assets of the games they are recording:

Sony took down the Blender Community’s Sintel from youtube http://boingboing.net/2014/04/06/sony-issues-fraudulent-takedow.html#more-296460

http://Waxy.org ’s Andy Baio or Nicole Wilke discusses experiences people have had with illegitimate claims: http://waxy.org/2012/03/youtube_bypasses_the_dmca/

So it is concerning and it definitely harms my youtube experience. I’m especially disturbed by the claims against my own music that aren’t sampling from copyrighted sources.

You’re a Computer Scientist What Do You Think?

I think what this kind of interaction highlights is that machine learning and media fingerprinting is not enough. We need Youtube and future crowd content providers to develop systems on top of these systems.

We need to recognize that the law is not the same.

We need to recognize that the public domain does not exist everywhere, but that should not those who have the rights to the commons. If per-country censorship is necessary to protect me from litigation as a user then so be it. Perhaps some of the claims against my work are legitimate in some venues. I’d rather my work be censored in those venues than face litigation. This is an ethical software issue, the user is not against you, you should help protect the user. Do not assume malice in the face of international venues and laws.

We need systems to learn cases where contentID matches but the uses are


In the case of public domain works both derivative makers have a right to the work. The provenance of the work really matters, and thus our content ID systems need to be aware of these scenarios. In the case of youtube it could be as simple as indexing all of the public domain movies in the USA to avoid false claims on public domain sources. Rules could be learnt depending on the provenance and context.

We need provenance aware contentID.

The world is a messy place and the legal rules are messy as well, contentID needs to know about the history of content, we cannot just trust a single rightsholder. They have been shown wrong many times in the past. Furthermore with public domain, a large site like Youtube should be able to determine the flow or the provenance of some of the material and aid in determining what is fair reuse and what is not.

Media needs more metadata about its provenance

Sadly our media files tend to lose metadata like the Heartbleed bug makes SSL services lose secrets. We need to enable better tracking and encoding of the provenance of entities for all those involved. Imagine if you make a remix and a system can tell you all the sources you used and manage that for you? That’s especially important in the Open Source world where the main currency is attribution.


I could go on, but just because you own a music video that uses the Odessa Steps doesn’t mean you own my content. Furthermore Youtube should be far more careful with OpenID and directly address the Public Domain. We should be free to use the commons as we please unmolested by content ID claims. Youtube contentID should know better, and Google has enough engineers to make it better.

The future of IP and provenance in media is interesting and I think there are a lot of avenues for researchers willing to try the legal side, but alas we also need the conferences to recognize the importance of IP and licensing.

Updates – 2013 in Review

How are you doing? Long time no see. I have been as busy as a bee.

What have I been up to?

An Attempt at a Video Lecture

Next week I’ll be at ICSM presenting a neat paper I worked on with Christian Bird, Thomas Zimmermann, and Nachiappan Nagappan: Relating Requirements to Implementation via Topic Analysis, in Proceedings of the 2012 International Conference on Software Maintenance (ICSM 2012), IEEE, 25 September 2012.

But that means I’ll be away! I’m currently teaching CMPUT301: Intro to Software Engineering. So what shall I do? I have guest speakers coming in but neither has significant experience with programming for Android, so I felt I had to prep the students before the assignment was due. The assignment was an Android app.

I decided to record a few lectures to cover the missing material that I could not get to yet.

I used:

To record audio and video I used a desktop recorder, this means whatever is on my screen the audience will see. This means close your email and clean up your desktop!

What I found was that record my desktop was very finicky. There were only a few settings I could use to ensure proper synchronization of frames. I had to use 22050hz audio, I had to not take a screenshot per frame, I had to set it to encode on the fly. If I didn’t use these settings I tended to produce videos that had wildly out of sync audio and video.

Once that was solved, I hunkered down, opened my class notes and started talking. Ubuntu has an accessibility option to highlight your cursor if you press control, I enabled that and found it useful to draw attention to the cursor.

Sound quality is a big issue, if you want to do this seriously you want a good headset in a quiet room and preferrably something to filter the noise. One might consider manually noise filtering the signal later and compressing it (eq + amplification). I filmed these videos outside so you can hear cars and trucks and airplanes.

Finding a space was surprisingly difficult on the go, you need a spot where you can use your projection voices, because you’re presenting and you need to project. Furthermore the audience can’t see you, so if you do need to make gestures I recommend a webcam program like cheese which allows you to just see yourself, the desktop recorder can take care of recording your webcam program.

In the end I produced some videos about Android and Sequence Diagrams:

Whether or not this is the future, I’m not sure, but because of the knowledge that lectures can be replaced by video I’ve been trying to make CMPUT301 more interactive by adding quizzes, in-class group work, discussions and other exercises. My wife suggests that learning isn’t just passive and sometimes we have get up and do something.


I recently just performed 2 performances of interactive music instrument: SWARMED.

This instrument allows the audience to play it with their own cellphones. They just need to sign up to my wifi network and they get redirected to a web instrument that plays out on the PA!

I will be improving the instrument greatly in preparation for my big noise set at the Victoria Noise Festival in August 23-24, 2012!

The Works Performance

Example instrument architecture