After some more trial and error with SDXL Turbo lora training I have got a new one ready to go and it's looking pretty good. Here are some initial example images that are just straight up the images out of stable diffusion with no additional controlnets or other lora involved.
Let me know how you think it's looking. I added a lot of images as the turbo model churns them out in a few seconds each and I just liked so many of them
**Info and disclaimer regarding my Lora files and images to stay ethical and compliant with UMD rules**
1. All training for my messy Lora files are done using photos which I produced and own copyright and usage rights for. No material belonging to anyone else was used in the training process. 2. Lora have been trained for style and concept keywords only and not to replicate any persons image or likeness. However, it may still be possible for someone to instruct the AI model in such a way as to reproduce a likeness of the original subject. For this reason I do not share my Lora files. Any images I upload do not resemble the original subject used for training. 3. Regularization images for the class prompt are all AI generated and do not resemble real people or are copyrighted images. 4. I do not use any additional 3rd party Lora files that produce likenesses of real people when creating AI images that are shared online. 5. All generated subjects are adults and I do not create images that depict any non-consensual or illegal acts.
That is looking great, massive improvement over the previous version.
I'm working on an update for my SDXL LORA. I'm curious of how you've approached your captioning.. Have you gone down the route of WD14/Booru style captions with different trigger words assigned via the folder structure? I've had some good success with that method but have also been testing more descriptive captioning for SDXL. I've been using GPT vision as a first pass of my image set before finetuning the captions manually. I've heard CogVLM and LAVA13b are also very good at the long form captions and was going to give this a test this week. GPT is good but obviously gives errors with NSFW content.
The reason why I was looking at the descriptive approach is that a lot of work seems to be happening in finetune models to move towards this and the rumours are SD will also move to this approach too as the LAION data set is so poorly captioned.
My first pass Vision captions are around 70-100 words similar to
Medium-shot of a woman with custard running over her face, dripping from her dark, wet-looking hair styled in a messy way, partial strands adhering to her face. The custard is gooey and thick, coating her shoulders and chest, with noticeable dripping patterns in a yellowish-beige hue showing textures resembling lumps and streaks. She is wearing a white lace dress with parts obscured by the gooey custard. Her pose displays a slight tilt of her head, with her eyes closed and one hand visible, dipped in the custard. Background includes a plain blue fabric with a soft wrinkling pattern.
messg said: That is looking great, massive improvement over the previous version.
I'm working on an update for my SDXL LORA. I'm curious of how you've approached your captioning.. Have you gone down the route of WD14/Booru style captions with different trigger words assigned via the folder structure? I've had some good success with that method but have also been testing more descriptive captioning for SDXL. I've been using GPT vision as a first pass of my image set before finetuning the captions manually. I've heard CogVLM and LAVA13b are also very good at the long form captions and was going to give this a test this week. GPT is good but obviously gives errors with NSFW content.
The reason why I was looking at the descriptive approach is that a lot of work seems to be happening in finetune models to move towards this and the rumours are SD will also move to this approach too as the LAION data set is so poorly captioned.
My first pass Vision captions are around 70-100 words similar to
Medium-shot of a woman with custard running over her face, dripping from her dark, wet-looking hair styled in a messy way, partial strands adhering to her face. The custard is gooey and thick, coating her shoulders and chest, with noticeable dripping patterns in a yellowish-beige hue showing textures resembling lumps and streaks. She is wearing a white lace dress with parts obscured by the gooey custard. Her pose displays a slight tilt of her head, with her eyes closed and one hand visible, dipped in the custard. Background includes a plain blue fabric with a soft wrinkling pattern.
For the captioning I don't do anywhere near that level of detail. My understanding with Lora training was that the caption should describe and identify everything that is not what you are trying to train and the training will focus on the thing in the caption that it does not know. For example with these I was training the words 'gunge' and 'gunged', neither of which SDXL models know very well. So a caption might look like:
"Gunged woman wearing a blue dress, black hair, sat on chair, one arm raised above head, covered in green gunge"
I did use a moonbeam based auto captioner to write the basics and then I have a mass replacement tool to replace words like liquid, substance or goo that it identified itself within the text files. After that it's a little human review to check they are all good.
I'm also rather fortunate in having WAM images that I shot myself that all have just the model sat on a stool with a plain white background. So there is no distraction in the data or anything to caption from the background. This training set was 90 images in size with no repeats but lots of epochs. The best epochs turned out to be between 82 - 92.
Whether this is all the best approach or not, I'm not sure. My take with previous attempts at complex captions was that it added confusion to the lora and it would end up doing multiple undesirable things when used. I guess with more detail you could be more descriptive in your prompts, perhaps, but I'm quite happy that this lora seems pretty flexible and applies gunge to most situations i prompt it with.
messg said: That is looking great, massive improvement over the previous version.
I'm working on an update for my SDXL LORA. I'm curious of how you've approached your captioning.. Have you gone down the route of WD14/Booru style captions with different trigger words assigned via the folder structure? I've had some good success with that method but have also been testing more descriptive captioning for SDXL. I've been using GPT vision as a first pass of my image set before finetuning the captions manually. I've heard CogVLM and LAVA13b are also very good at the long form captions and was going to give this a test this week. GPT is good but obviously gives errors with NSFW content.
The reason why I was looking at the descriptive approach is that a lot of work seems to be happening in finetune models to move towards this and the rumours are SD will also move to this approach too as the LAION data set is so poorly captioned.
My first pass Vision captions are around 70-100 words similar to
Medium-shot of a woman with custard running over her face, dripping from her dark, wet-looking hair styled in a messy way, partial strands adhering to her face. The custard is gooey and thick, coating her shoulders and chest, with noticeable dripping patterns in a yellowish-beige hue showing textures resembling lumps and streaks. She is wearing a white lace dress with parts obscured by the gooey custard. Her pose displays a slight tilt of her head, with her eyes closed and one hand visible, dipped in the custard. Background includes a plain blue fabric with a soft wrinkling pattern.
For the captioning I don't do anywhere near that level of detail. My understanding with Lora training was that the caption should describe and identify everything that is not what you are trying to train and the training will focus on the thing in the caption that it does not know. For example with these I was training the words 'gunge' and 'gunged', neither of which SDXL models know very well. So a caption might look like:
"Gunged woman wearing a blue dress, black hair, sat on chair, one arm raised above head, covered in green gunge"
I did use a moonbeam based auto captioner to write the basics and then I have a mass replacement tool to replace words like liquid, substance or goo that it identified itself within the text files. After that it's a little human review to check they are all good.
I'm also rather fortunate in having WAM images that I shot myself that all have just the model sat on a stool with a plain white background. So there is no distraction in the data or anything to caption from the background. This training set was 90 images in size with no repeats but lots of epochs. The best epochs turned out to be between 82 - 92.
Whether this is all the best approach or not, I'm not sure. My take with previous attempts at complex captions was that it added confusion to the lora and it would end up doing multiple undesirable things when used. I guess with more detail you could be more descriptive in your prompts, perhaps, but I'm quite happy that this lora seems pretty flexible and applies gunge to most situations i prompt it with.
No, I still think it's a good approach. SDXL was built with the intention of being able to use more verbose/descriptive prompting but as the base data set was still so poorly captioned, the prompt interpreter still doesn't work as well as it should with those types of prompts. It's better than SD1.5 with descriptive prompting but still lacking. I took a similar approach with tagging initially and it just works as you say. I used a slightly larger dataset with no repeats and about 30 Epoches.
JuggernaughtXL, PonyXL and NightVision finetunes all use descriptive prompting as part of their finetune. When SD actually release their own Synthetically captioned model mind.
Fooocus https://github.com/lllyasviel/Fooocus uses GPT2 to enhance the basic prompts similar to the way Dalle3 does and it's quite interesting to see the difference.
Overall, I'm quite impressed with the progress thats happening with SD. I think with the better prompt interpretation and model captioning it'll close the gap to Dalle3 quite quickly. We were spoilt when Dalle3 released initially mostly unfiltered but those days are gone and can't see them easing up on their filtering again.
Overall, I'm quite impressed with the progress thats happening with SD. I think with the better prompt interpretation and model captioning it'll close the gap to Dalle3 quite quickly. We were spoilt when Dalle3 released initially mostly unfiltered but those days are gone and can't see them easing up on their filtering again.
I'm using the same tool but only started it with moonbeam and haven't tried CogVLM yet. Will give it a go when I get some time.
I've been away a bit and probably missed the Dalle3 stuff when it was posted, are there any good examples to have a look at? I've avoided the big online tools for WAM stuff as I never like the results people shared with them. They always look a bit odd, unspecific and very "AI" (Bing included). I definitely think open source is the answer to get what we actually want rather than playing around until we make something a bit pseudo-WAM.
Overall, I'm quite impressed with the progress thats happening with SD. I think with the better prompt interpretation and model captioning it'll close the gap to Dalle3 quite quickly. We were spoilt when Dalle3 released initially mostly unfiltered but those days are gone and can't see them easing up on their filtering again.
I'm using the same tool but only started it with moonbeam and haven't tried CogVLM yet. Will give it a go when I get some time.
I've been away a bit and probably missed the Dalle3 stuff when it was posted, are there any good examples to have a look at? I've avoided the big online tools for WAM stuff as I never like the results people shared with them. They always look a bit odd, unspecific and very "AI" (Bing included). I definitely think open source is the answer to get what we actually want rather than playing around until we make something a bit pseudo-WAM.
For a short while Dalle3 was nothing short of amazing and slowly they increased the content filtering. If I knew ohw to prompt engineer when it first released, I would have been able to generate far better. The ones attached are a sample of some of the sets I ran before the big filter change at the end of Jan. I doubt we'll see anything like this again from Dalle3 or future OAI products which is a shame as quite clearly it's got the capacity to produce some outstanding images. I can only imagine what images were blocked by the image filter.
For a short while Dalle3 was nothing short of amazing and slowly they increased the content filtering. If I knew ohw to prompt engineer when it first released, I would have been able to generate far better. The ones attached are a sample of some of the sets I ran before the big filter change at the end of Jan. I doubt we'll see anything like this again from Dalle3 or future OAI products which is a shame as quite clearly it's got the capacity to produce some outstanding images. I can only imagine what images were blocked by the image filter.
Ah, I see. Definitely not my sorta thing but I can kinda see it in some of them. I could see how prompting this sort of image would be quite different from stuff I like and might require a different approach to the lora training.