This post is more for the techies here, but check out this paper and associated model on Huggingface. Not quite in the package needed to be a user-facing model yet, but the core tech requirements of generating minute-plus videos maintaining context, consistency and quality appear to have been solved! Implications should be kind of obvious!
As I always say, a few years from now we will think it was "cute" to make one minute vids like this as A.I. takes us into the age of 30 minute (and longer) vids.
The impetus for this is surely commercial film and media. You can imagine how much advertising agencies for example would love to be able to generate a commercial entirely by AI.
Gloopsuit said: The impetus for this is surely commercial film and media. You can imagine how much advertising agencies for example would love to be able to generate a commercial entirely by AI.
They already are. Coca Cola released an advert early in the year. Most advertising shops now use AI to mock up and story board. You also have to consider current public models aren't necessarily the same models being shown and used by professional studios. Public models have been quantised and shrunk down. All that said, the biggest limitation is censorship on public models. It doesn't matter how good they are when they're overly censored. Open source models are great and all but they're severely limited by how much memory is available on GPUs. That is going to take a lot longer to catch up and become cheaper.
yeah this is the main point: the big foundation models, served by public APIs, are fine for most advertising tasks.
I disagree on your timeline a bit though. Distillation has been persistently undercounted, and right now the best foundation video models are only marginally better than the best open-weights models. We're also not limited by peoples' home GPUs when cloud services like Runpod are considered.
dionysus said: yeah this is the main point: the big foundation models, served by public APIs, are fine for most advertising tasks.
I disagree on your timeline a bit though. Distillation has been persistently undercounted, and right now the best foundation video models are only marginally better than the best open-weights models. We're also not limited by peoples' home GPUs when cloud services like Runpod are considered.
Google is flying ahead due to their investment in TPUs, it's the reason why they can serve veo so cheaply and start adding speech and audio. Wan2.2 is phenomenal for what it is but I'd argue it's at least a generation behind VEO3 and severly bottlenecked by resources. You're right though, you could through H100s but who's going to do that? I'm lucky enough to have an RTX pro 6000, 96Gb. It's still time consuming to train Qwen and Wan2.2 at any decent precision. Nivida could have easily released consumer cards with 48Gb+ but chose not to rather than cannibalize their pro cards. Without competition, it's really going to slow open weight models down. The amount of loras and finetunes of the larger Qwen, HiDream and WAN models is significantly lower than SDXL and even flux because people can't afford to.