A look into next-generation AI Video Models

Over the weekend I took a look into Veo3, a new video generation model from Google. My first thought when seeing it in action was, damn we are fucked as a society. The ease of video creation, the quality of videos, and the audio generation are some of the most mind-blowing things I have ever seen. To think just about two years ago, we had Will Smith eating spaghetti, which made it seem like video generation is years away.

Then a year ago Open AI's Sora launched which blew our minda away, the progress of video generation makes me think in the next two years, we can have creators make full sitcoms/movies fully from AI.

New in Veo 3

What separates Veo3 from all other AI video generation models (Sora, Stable Diffusion, other Veo models) is the new addition of sound. This is what takes hobby creation into full-fledged production-grade quality. I experimented with the model, and created some amazing vidoeos, with surround sound quality. It's one thing, to be able to generate sound that is in stereo format, but to add multi-channel support on their first attempt, makes me both really really excited, and nervous.

Veo 3 Features

Veo 3 allows users to create videos in a couple ways:

Text to Video
- This is pretty simple, the same way we prompt images, we can create prompts for videos. The more detailed the prompts, the more success you will have in videos.
Frames to Video
- Allow users to create videos from images. I gave it a simple image of me, and the prompt "Make me run for office" and it generated a pretty high-quality video.

Ingredients to Video (Google AI Ultra subscribers only)
- Modular control: you generate individual elements—called ingredients—and then combine them into a scene.

What's Next and Final Thoughts

We have come a long way, from where we were just a couple years ago, to now. However, I can see the progress in the next few months, and years.

Currently, we can generate videos on Veo3 for around 8 seconds, and Sora for around 15 seconds. Obviously, the next steps would be to extend this 1 minute, 10 minutes, and so on.
Next, currently video content has a hard time continuing scenes and extending them with the same characters, scenes, etc. Creating a system, that can understand what needs to stay the same, and what needs to change will quickly follow after the extending video times.
Lastly, open source. The open-source community is growing fast, and the quality of these open-source models is almost near the live production of these big companies. I believe by 2050, we will have better models in open source contributions, than we do from these companies such as Amazon, Google, etc.

A look into Veo3

A look into next-generation AI Video Models

New in Veo 3

Veo 3 Features

What's Next and Final Thoughts