I trained a custom lora for Wan2.1! It's just stupid how good this is already. For a rookie effort I mean. This is for the text-to-video model, and it's trained on **images**, not even video. I just pulled in the dataset I had from my last Flux lora. Feels wonky in some ways, and there are limitations, but mostly, wowzers.
Watch this thread! I will put more gifs on it as I explore different stuff.
Here: gifs from some of the first text-to-video generations I ran, fiddling with different settings. These aren't the most creative scenarios. But, you see what it can do. From just a text prompt! And the hit rate was really high for these.
So if I'm understanding correctly it's using an image based lora for prompt reference, but all the motion is based on the model's existing dataset?
Looking at using Runpod myself - noting that you pay a very affordable rate for GPU access, what would you say the cost per clip or cost per second of video is?
So if I'm understanding correctly it's using an image based lora for prompt reference, but all the motion is based on the model's existing dataset?
Looking at using Runpod myself - noting that you pay a very affordable rate for GPU access, what would you say the cost per clip or cost per second of video is?
It's a little more complex than that. The image set also trains the visuals along with the caption understanding. With WAN you can also train on videos which will also capture the specific motions.
So if I'm understanding correctly it's using an image based lora for prompt reference, but all the motion is based on the model's existing dataset?
Looking at using Runpod myself - noting that you pay a very affordable rate for GPU access, what would you say the cost per clip or cost per second of video is?
I think it's basically telling Wan "for this prompt, create video frames that look like this image"? ...
You can train Wan on short videos too, but when I tried adding a few of them, it immediately caused videocard memory problems that I can't solve. (file size? frame rate?) I bet this gets easier over time though, with more tutorials or training frameworks.
Runpod cost: I actually use Google Colab mostly (I might switch), broadly I think these are comparable and it's something like 5 to 10 cents an image? It's waaay cheaper than Kling anyway. I would recommend it. It takes some time to learn ComfyUI too -- I started back in the Stable Diffusion days -- that has been a worthwhile investment, they have added capability for new models as they roll out. I trained my Flux loras in a Comfy template for example.
So if I'm understanding correctly it's using an image based lora for prompt reference, but all the motion is based on the model's existing dataset?
Looking at using Runpod myself - noting that you pay a very affordable rate for GPU access, what would you say the cost per clip or cost per second of video is?
It's a little more complex than that. The image set also trains the visuals along with the caption understanding. With WAN you can also train on videos which will also capture the specific motions.
That's kind of what I meant, but you put it better - I was thinking that the image data and caption data was acting like a 'super prompt image', the equivalent of an image prompt in image-to-video but with much more data for the model to work with (I know this is just thinking by analogy and probably not what's actually happening). Wan2.1 is looking very, very promising for what I make, but I'm wary of it being a massive time-sink...
wammypinupart said: Runpod cost: I actually use Google Colab mostly (I might switch), broadly I think these are comparable and it's something like 5 to 10 cents an image? It's waaay cheaper than Kling anyway. I would recommend it. It takes some time to learn ComfyUI too -- I started back in the Stable Diffusion days -- that has been a worthwhile investment, they have added capability for new models as they roll out. I trained my Flux loras in a Comfy template for example.
That is cheaper, cheers. Certainly worth exploring
That's kind of what I meant, but you put it better - I was thinking that the image data and caption data was acting like a 'super prompt image', the equivalent of an image prompt in image-to-video but with much more data for the model to work with (I know this is just thinking by analogy and probably not what's actually happening). Wan2.1 is looking very, very promising for what I make, but I'm wary of it being a massive time-sink...
Unfortunately, you can't avoid it being a timesink. On the plus side, you're investing in learning which will help with many things in the future. What we learn from image training translates to video and so on. Just start small and simple and build on it, it's vastly more rewarding than trying to rely on Commercial models where you are constrained to the whim of the company.
Wan has two different models for text-to-video (t2v) and image-to-video (i2v). My lora is for the t2v version. But I read that you can use a t2v wan lora on i2v generations too, so of course I had to try that out.
Because it would be cool if I could start with a single clean image, and turn that into a slime video. Like kind of the dream AI generator. But there's no way that this will actually work, right? With this jenky first-try lora, and this weird type of niche video I want to make? No way. Right?
wammypinupart said: Gif share number two. Buckle up!
Wan has two different models for text-to-video (t2v) and image-to-video (i2v). My lora is for the t2v version. But I read that you can use a t2v wan lora on i2v generations too, so of course I had to try that out.
Because it would be cool if I could start with a single clean image, and turn that into a slime video. Like kind of the dream AI generator. But there's no way that this will actually work, right? With this jenky first-try lora, and this weird type of niche video I want to make? No way. Right?
... Right?
Really impressive stuff! Seems like the animation is slightly inferior to the T2V, but with the obvious bonus of the "holy grail" workflow of generating clean images and then just applying the mess in the vid gen. Any tips about the curation of the image dataset to include in the LoRA? Does it need examples of different stages of a sliming, and does that need to be explicitly captioned? Or is it just a variety of high-quality slime images? Managed to get it going on runpod, but haven't actually trained anything yet because now i need to go back and actually collect the training data.
5-ht said: Any tips about the curation of the image dataset to include in the LoRA? Does it need examples of different stages of a sliming, and does that need to be explicitly captioned?
Your guess is as good as mine, that's what I'd try if I were doing another one. I didn't do anything that fancy this time around. Maybe there are some tips out there on the internet, it's all pretty new though and the more typical lora case is a style or character.
Looking forward to seeing what you come up with! Good luck!
wammypinupart said: Gif share number two. Buckle up!
Wan has two different models for text-to-video (t2v) and image-to-video (i2v). My lora is for the t2v version. But I read that you can use a t2v wan lora on i2v generations too, so of course I had to try that out.
Because it would be cool if I could start with a single clean image, and turn that into a slime video. Like kind of the dream AI generator. But there's no way that this will actually work, right? With this jenky first-try lora, and this weird type of niche video I want to make? No way. Right?
... Right?
Really impressive stuff! Seems like the animation is slightly inferior to the T2V, but with the obvious bonus of the "holy grail" workflow of generating clean images and then just applying the mess in the vid gen. Any tips about the curation of the image dataset to include in the LoRA? Does it need examples of different stages of a sliming, and does that need to be explicitly captioned? Or is it just a variety of high-quality slime images? Managed to get it going on runpod, but haven't actually trained anything yet because now i need to go back and actually collect the training data.
The same principle applies as to training an image model. large diverse image sets are best. Hi quality images, variety of subjects, backgrounds and mess. Caption images should be done with detailed descriptions of the images. Keywords do work but it's not going to create the most flexible of models.
Another image-to-video batch. Simple scenes again, just stretching it out a tiny bit: can it do different colors? (I only have green in the training set.)
One last gif share for the thread! I'm off and running after this. I'm still not entirely sure what to do with this format! ... but there will be more.
Trying out some slightly more elaborate scenes to see what happens, back to text-to-video except for a couple I extended (i2v on the last frame) and combined. It continues to be surprisingly coherent. The success rate for getting useable results is lower, the more complicated the prompt is, but it's still pretty high, like I'm pretty happy with about a third to a half of what comes back. I have to learn to use a video editor now, lol.
wammypinupart said: Another image-to-video batch. Simple scenes again, just stretching it out a tiny bit: can it do different colors? (I only have green in the training set.)
You can't go wrong with black, or blue, but... old school green is so much fun.