VSVideo - A little project using Remotion
What is Remotion?
Recently, a friend of mine introduced me to this javascript framework called Remotion (opens in a new tab). It's a framework that allows you to describe a video as a React (opens in a new tab) component, and then render it to a video file. You can sorta think of it like manim but with web design instead of math.
Since you're using React, you're describing the video as a real webpage that can change over time. You have access to all sorts of features like hooks and state to make your video dynamic, basically being able to format any sort of data into a video. They present it as being useful for mass producing videos or to quickly create videos based on templates for social media posts or data visualisations. However, I had a different plan in mind.
What the heck is a SynthV?
SynthV (opens in a new tab) or Synthethiser V is a commercial voice synthesiser meant for singing. If you know Vocaloid, it's basically a significantly more modern version of that tool. It uses special voice banks to generate the audio and even uses some cooler techniques like AI denoising or AI-based tuning which makes it sound very realistic if you have a good voice bank.
A friend of mine, Masuna (opens in a new tab) often produces xer music using voice synthesis software like SynthV and it has been xer program of choice recently. If you look at xer music videos, you'll notice that they are very simple. Some lyrics timed with the music, an audio visualizer and some cool album art.
If you're a programmer, you can probably tell where this is going. If the video is overall static with the only two animated events being generated either directly from the audio waveform or from a voice that another program generated, what's stopping me from generating the whole video programmatically?
VSVideo
And that's where VSVideo comes in. VSVideo is my project for a program that could programmatically generate music videos for songs that use voice synthesis. The idea is that the project file for the voice stores all of the lyrics meaning that they can be extracted and timed with the actual audio, allowing you to properly display them with little to no human intervention. Additional aesthetic elements can also be implemented, such as the audio visualizer or album art.
And so I was set on my plan to automatically generate a music video or die trying.
Figuring out Remotion
My first instinct was to just create a remotion project and try to learn how it works. I went ahead and ran the npm init
command for remotion:
npm init video
Then, I decided to select the Hello World template, hoping it would contain enough information for me to figure out how to use the system. I was wrong, the template was way more advanced than I thought a "Hello World" template would be. It had a complex svg animation, different text that would appear at different points in time with keyframes, and a whole lot of stuff I was struggling to understand. So I scrapped it, deleted everything in the HelloWorld
component and wrote my own Hello World.
export const HelloWorld: React.FC = () => {
// A <AbsoluteFill> is just a absolutely positioned <div>!
return (
<p>Hello World!</p>
);
};
Alright, it doesn't look like much and it takes some squinting to see it in the render but it does exist! The video can be rendered out to an mp4 with npm run build
however it's a lot easier to use the preview command npm run start
to quickly see changes and experiment. The preview boots up really fast and updates in real time, as well as providing you with a few controls to pause, play, and skip around the video.
Designing VSVideo
Now that I knew more or less how this was gonna go, I started thinking about how I wanted VSVideo to work. I ended up settling on four basic layers, a background image, a foreground image, an audio visualizer and the lyrics. The user could place all of these files in one folder and then configure how exactly they want to lay them out, like their position, scale, etc.
While looking around in the documentation, I ended up settling for the staticFile
API in remotion. It practically allows you to fetch a local file from the filesystem as if it was a remote file. This API is really simple to use, you just pass the name of a file in the staticFile
function and it returns the URI of the file.
const image = staticFile("image.png");
console.log(image); // file:/home/user/src/vsvideo/public/image.png, for instance
One of its limitations though is the fact the file needs to be in a specific subfolder of the project. It is named public
by default but you can change it if required. While not ideal, this works well enough for my case and it's very simple to use.
Next, I need to figure out how this will be actually displayed. Ultimately your video needs to be described as a "Composition" which is really just a React component. I created a composition called AudiogramComposition
(shamelessly copied from the audio visualizer example) and added a couple components to it.
const ref = useRef<HTMLDivElement>(null);
console.log(props.config)
const offset = 0;
return (
<div ref={ref} style={props.style}>
<Background config={props.config} />
<Foreground config={props.config} />
<Lyrics config={props.config}/>
<AbsoluteFill>
<Sequence from={-offset}>
<Audio src={staticFile(props.config.audio)} />
<div className="container">
<div>
<AudioViz config={props.config} />
</div>
</div>
</Sequence>
</AbsoluteFill>
</div>
);
The AbsoluteFill
component is a remotion component that fills the entire video with a div, the Sequence
component is a remotion component that allows you to specify a time range for the children to be displayed and the Audio
component is a remotion component that allows you to play an audio file in the background. Beyond these three, I created my own components for the background, foreground and lyrics. Notice how they each take a config
prop. That's going to be very important later.
The offset
is also something that comes from the audio visualizer template. It's not really used in this case but it can be useful to implement later.
The 4 important components
The first two components Background
and Foreground
are super simple. Background
is just a div with a background image and Foreground
is just an image that can set its position and scale based on the configs passed to it.
const Background = (props: {config: UserConfig}) => {
return (
<div
style={{
position: 'absolute',
top: 0,
left: 0,
width: '100%',
height: '100%',
backgroundImage: `url(${staticFile(props.config.background_image)})`, // * Temporarily disabled for testing
backgroundSize: 'cover',
backgroundPosition: 'center',
backgroundRepeat: 'no-repeat',
}}
/>
);
};
const Foreground = (props: {config: UserConfig}) => {
// Take position from props
return (
<Img src={staticFile(props.config.foreground_image)} style={{
position: 'absolute',
bottom: `${props.config.foreground_image_offset.from_bottom}`,
left: `${props.config.foreground_image_offset.from_left}`,
transform: `scale(${props.config.foreground_image_offset.scale})`,
backgroundSize: 'cover',
backgroundPosition: 'center',
backgroundRepeat: 'no-repeat',
}} />
);
};
The AudioViz
was very strongly based off of the one in the Audio Vizualiser sample, not that it's a lot of code either way as remotion directly provides you with the tools to perform audio visualization.
const AudioViz = (props: {config: UserConfig}) => {
const frame = useCurrentFrame();
const { fps } = useVideoConfig();
const audioData = useAudioData(staticFile(props.config.audio));
if (!audioData) {
return null;
}
const allVisualizationValues = visualizeAudio({
fps,
frame,
audioData,
numberOfSamples: 128, // The higher the number the smoother the curve
});
const visualization = allVisualizationValues.slice(8, 64); // These values just look cool, can be changed
const mirrored = [...visualization.slice(1).reverse(), ...visualization];
return (
<div className="audio-viz">
{mirrored.map((v, i) => {
return (
<div
key={i}
className="bar"
style={{
height: `${props.config.audio_viz_config.height_multiplier_percentage * v}%`,
}}
/>
);
})}
</div>
);
};
This is a fair bit more advanced, but as you can see all you need to use is the visualizeAudio
function provided by @remotion/media-utils
and then play around with the values it gives. The useAudioData
hook is also provided by remotion and it allows you to fetch the audio data from a file. As you can see, we are selecting the audio file based on a key in the config file.
The last component you see is the Lyrics
component. Oh boy, the Lyrics
component.
That was by far the single most complex component, so much so it actually has its own whole file. I'm not gonna go into too much detail about it, but I'll try to explain the basics.
The lyrics component
The first step of being able to actually display the lyrics is to, you know, know what the lyrics are. Now, this was helped a lot by the fact that SynthV's files actually do store the "word" that each note is based off of. However, that doesn't work perfectly. Sometimes, the tuner needs to use an arbitrary phoneme in addition to the lyric in order to make the system pronounce a word correctly. This means we need to somehow figure out how many notes actually match our lyric.
The solution I came up with is very simple and quite frankly pretty bad. I take the lyric and I use a very simple algorithm to calculate the amount of syllables that it has. I then skip that amount of notes. The syllable counting function is
export const numberOfSyllables = (text: string) => {
const vowels = ['a', 'e', 'i', 'o', 'u', 'y'];
let count = 0;
let lastWasVowel = false;
for (let i = 0; i < text.length; i++) {
const isVowel = vowels.indexOf(text[i].toLowerCase()) !== -1;
if (isVowel && !lastWasVowel) count++;
lastWasVowel = isVowel;
}
return count;
}
Just from looking at it you can tell how many issues it has, but it's just good enough to pass for now.
The next step is to place the lyrics in time. Each note object has an onset
property as well as a duration
property. For our maths it's actually a lot easier to calculate the start and end of each note. So we can just do onset + duration
to get the end of the note. We can then use this to calculate the start and end of each word.
After this, we need to identify chunks of lyrics that are on the same line. I decided to attempt to detect "sentences" by identifying unusually long pauses in the lyrics. This is done by calculating the average distance between the end of a note and the start of the next note. Any distance greater than the average is considered to be the end of a line.
Thanks to this, we now have a list of sentences, each with a list of words, each with a start and end time. We can therefore just take the current moment in time to find the matching sentence and display it.
That would be if we knew what the actual time units in SynthV files are! Turns out, they're not seconds, milliseconds, beats, or any sort of reasonable unit to assume. It's a weird ass number that's 1 / 705600000 of a beat. Yes, that's right, 705600000. I have no idea why it's that number, but it is. So we need to convert the time units to seconds. This is done by dividing the time by 705600000 and then multiplying it by the BPM. This is then multiplied by 60 to get the time in seconds.
Now that we know where each sentence starts and ends, we can display them in sync with the audio! This is done by using the useCurrentFrame
hook to get the current frame and then using the useVideoConfig
hook to get the FPS. We can then calculate the current time in seconds by dividing the current frame by the FPS. We can then use this to find the sentence that is currently being sung and display it.
Finishing touches
To make sure everything works we have to ensure that the user has access to a easy way to configure the video. I decided to use a JSON file that will be accessed with the staticFile
API. Actually creating the design for the JSON file was a pretty tedious process as I had the great idea of creating a JSON schema to go along with it. Long story short, I was eventually able to parse some config into a nice config object that I could use in the video.
However, I had a pretty big issue ahead of me. In order to read a file from staticFile
it needs to be fetch
ed. fetch
is an asynchronous function, and I can't use await
in a React component. I first considered doing the "correct" thing which would've been using a React hook but that didn't work, the config was required for other hooks and since it wouldn't be fully loaded until a couple frames in meaning that not all hooks could always be called, which makes React very unhappy. After trying out a bunch of messy solutions I decided to do the single most stupid thing I could think of
const synchronousFetch = (url: string) => {
const xhr = new XMLHttpRequest();
xhr.open("GET", url, false);
xhr.send();
return xhr.responseText;
};
Yup, synchronous fetch. I'm not proud of it, but it works. I then used this to fetch the config file and parse it into a config object while forcing the thread to wait until it was done. The root component then passes this config object to all the other components as a prop.
What's next?
I'm not planning on stopping here, but it will also be a while until I start working on new real features. A couple ideas I have are:
- Adding support for a MIDI file to be taken as an orchestrator of sorts for dynamic visual effects and animations.
- Adding docker support so that people can easily generate videos on their own machine.
- Adding support for other voice synthesizers, such as Vocaloid or UTAU.
- Creating a GUI for changing the configuration and previewing the video to make customizing the video easier.
Conclusion
Remotion is pretty cool! Fully programmatic video creation is definitely a cool concept and I'm excited to see what other cool ways it can be used in. Remotion still feels like it needs to iron out some kinks but it's more than usable in its current state and the experience with using it was really fun! Less positive though was my experience with React, while it definitely is a lot better than raw html and has some cool features, it's a little awkward to take advantage of and I'm not sure if I'll be using it for actual websites anytime soon. In the context of remotion though it works pretty well conceptually and in practice.
Overall, I think this was a pretty fun project. The source code will be uploaded once I get done with some licensing stuff. I'll add a link to the source code and to an example video once it's done.
Thanks for reading!