Deepfaking Elon Musk
Here is a short reminder to show you how easy it is to manipulate video. I generated this in less than 5 minutes from just a photo. Yes, I have been experimenting with deepfakes and other synthetic media.
I am not a developer, so I started out just to see how far I could come. I looked up many tutorials and found a Google Colaboratory based on the First Order Motion Model for Image Animation on Github (Lathuilière, Ricci, Sebe, Siarohin & Tulyakov, 2019). This made it possible for me to animate a still image of Elon Musk with a driving video of myself.
However, the problem was that I couldn’t animate audio this way, and to make his ‘fake presentation’ more believable I would need more than just a face close-up.
Analysing source video’s and audio
I analysed many presentation video’s of Elon Musk before I started a more significant deepfake project. I transcribed some of his videos to master his way of talking, and I noticed he has many uhhm’s and double or stuck words, especially when he is talking about a complicated topic like Neuralink. I wrote a new speech including all of these struggles, but this time about a speculative topic. I also paid attention to his posture, his clothing, the background and the slides he uses in his speeches.
Generating new audio from voice input
When I found out a way to generate new audio from 5 seconds of Elon Musks voice, I had already figured out a different way to solve the audio problem. A friend of mine, who I think has a voice closest to Elon’s, provided me with more than 30 minutes of training data for the program Descript. With this sample, I generated new lines of text, dubbed by my friends’ voice. I couldn’t do this with Elon Musk his voice because you need to stick to a specific training script to give consent about using your voice.
The generated voice sounds very similar to the voice of my friend, but it still sounds robotic. The more I work with my friend’s voice and watch Elon Musk’s videos, the more different the two persons start to sound from each other. Next time I would want to try out Real-Time-Voice-Cloning (Chen et al., 2019) because I could use Elon Musks his real voice, even though the results might not optimal either.
I also looked into “live” deepfaking during video calls. For this, I used Avatarify also via a Google collab and GitHub page (not to be confused with the app I mentioned before), which uses the First Order Motion Model as well. I want to use this mainly in my presentation to grab the audience’s attention because it’s pretty entertaining. I managed to use photo’s of my teachers and animate them right there in the call with my webcam input as a driving source!
There was another problem I ran into when trying to make less of a close-up and an actual video (instead of a still image) driven by another video. I found a great way to do this using DeepFaceLab 2.0, but it was mainly to put one’s head on someone else’s body, and this body would become the driving video, also for the facial expressions. In my case, this couldn’t be a video of Elon Musk, because he is not saying the words in the script that I made up. So I had to ask someone to stand-in for Elon Musk’s place, dress up like him and dub the lines of generated speech. It was harder than expected. The lines of text weren’t always easy to repeat at the same pace as the generated voice. We ended up with many small video pieces that we had to match manually with the audio before feeding it to the DeepFaceLab generator. I do like how the green screen video turned out and how well the stand-in fits the actual posture of Elon Musk.
To train the final model, I used a version of DeepFaceLab from late 2019 (Chervoniy et al., 2019). In the last year, a lot has changed in the quality and the workflow. I tried a 2021 version, but it’s impossible to run it on the CPU of my laptop, and I ran into some errors while trying to use it via Google Collaboratory. I wasn’t able to figure it out in time, so I stuck with an earlier version.
Training the model
In the figures above (Figure 11. & Figure 12.) you can see how the synthesised images which you can see in columns 2, 4 & 5, get more and more accurate over time. In total, I let the training run for about 30 hours on a MacBook Pro from 2017 with Dual-Core Intel Core i5 Processor. The first figure is a screenshot from about 15 minutes after the start and the second after about 13 hours of training.
Chen, Z., Jia, Y., Moreno, I.L., Nguyen, P., Pang, R., Ren, J.S.F., Wang, Q., Weiss, R.J., Wu, Y., & Zhang, Y. (2019). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. Retrieved from: https://arxiv.org/pdf/1806.04558.pdf
Chervoniy, N., Dpfks, Mr., Facenheim, C.S., Gao, D., Jiang, J., Liu, K., Marangonda, S., Perov, I., RP, L., Umé, C., Wu, P., Zhang, S., Zhang, W., & Zhou, B. (2019, December 28). DeepFaceLab: A simple, flexible and extensible face swapping framework. Retrieved from: https://github.com/iperov/DeepFaceLab/tree/28549dc153948f3dd91bddfdc57ea9cc7cb87d66
Lathuilière, S., Ricci, E., Sebe, N., Siarohin, A. & Tulyakov, S. (2019). First Order Motion Model for Image Animation. Retrieved from: https://aliaksandrsiarohin.github.io/first-order-model-website/