Seeing that these things can illustrate, can write, and can understand what you mean, it'll probably only be a few years before these AI's are spitting out entire custom movies. Well-written and well-shot ones. 3d ones.
Interactive ones even. We'll be able to immerse ourselves into a virtual wam scene that we direct in realtime just by having a conversation. Insane amounts of gunge. Crazy or impossible scenarios. Your perfect model or companion doing anything you want. It's gonna be wild
As yet, the various AI imaging tools aren't at the level of understanding concepts. Ask stable diffusion to draw machine-items, like railways, and while it's clear it knows what a railway looks like, it doesn't understand "how it works", so you get tracks that you could never actually run a rail vehicle over, wheels not on rails, that kind of thing.
The story generators appear to, though I'm not convinced it's yet full understanding, as opposed to just knowing how word structures go together but without a deeper unerdstanding of what's actually happening within the plot.
But yes, I suspect within ten years that deeper level of understanding will be there, and all of us wam producers will be redundant, because people will be able to go to an on-line system, pay a few dollars, and download a Hollywood-quality movie of their ultimate WAM scenario featuring the models, outfits, messes, reactions, etc, exactly to their own personal taste. And it won't matter if it's extreme full hardcore or TV family gameshow, other than that the XXX-capable systems will probably cost more and need more detailed ID verification to access.
We are all running on borrowed time. The only question is how soon it actually happens.
The story generators appear to, though I'm not convinced it's yet full understanding, as opposed to just knowing how word structures go together but without a deeper unerdstanding of what's actually happening within the plot.
The "story generators", by which I guess you mean transformer models like GPT, also don't understand anything. They can give a good impression of doing so, but they're just making statistical approximations about the most likely next word, in a way that's not so dissimilar to what diffusion models are doing. Often it's very convincing, but they are as convincing when they are telling you something real as when they are just confabulating.
IMO we're a way off the kind of "deep" understanding that you're suggesting here -like where the model can not only parse an image for the objects and events taking place in it, but also intuit the physics involved, and sort of "run the tape" of the image forward.
What's probably a lot more easy to do is a model like this:
1. input a prompt into a text model, which creates a "skeleton" of a story. Let's imagine something simple here for now: woman opens door, gets hit by pie.
2. Use the text model to generate a series of keyframes by training it to write appropriate prompts based on the story skeleton. Here the inherent "understanding" of the concepts is at the text level, and "understood" by the transformer model. It can be trained to come up with good key-frame prompts- "woman standing in doorway, looking shocked, she is wearing etc..."
3. The keyframes are checked either by a human operator or by an image description network (basically checking that the images match the prompts)
4. time-based diffusion model fills in the gaps between the keyframes (think like the current Facebook/Meta model that makes short videos from images).
What's quite cool is that this could happen fairly rapidly - upcoming version of SD is claiming to have 50Hz image generation capabilities.