So whilst Bing has been a bit crap recently making the images I want, I've started working on an SDXL lora. Captioning and labelling is taking it's time but I've had some good success with smaller datasets. I'm using SDXL for a few reasons, higher resolution before upscaling, better understanding of captioning and prompts the main reasons.
I'm training a general concept dataset rather than for specific substances/clothing. The reasoning is I can supplement with additional loras for clothing/backgrounds if the base model isn't enough but I can also easily change the substance in the prompt rather than training specifically for each one.
Each of the images is the same seed, same prompt with the exception of the substance which I changed. I haven't trained the model specifically on the substances.
These are lower sample images for testing and with higher sampling and upscaling I think the model will perform well. The current model is based on 30 images but will look to retrain with nearer 100.
Nice work, looks promising! Love your work generally and glad you've started experimenting with SD. Upside is that the various custom checkpoints out there often give more aesthetically pleasing images than DALL-E3, and obviously with all the plugins and LoRAs and ControlNets etc you have a lot more control. Main downside is definitely contextual awareness, which is way weaker than DALL-E. For example none of the models I've used can cope with the concept of mess pouring down from above onto someone. Only way around it is photo-bashing and ControlNets.
I'm not really using SDXL much at the moment because even with a cloud machine running comfyUI on an RTX5000 it's a bit clunky compared to 1.5, and I personally don't notice enough upside.
One thing to remember, and it's not clear either way from the pics, is that the training set needs as much variety as possible in terms of what you're going to want to alter in future, for example if you're wanting to vary the outfits and just keep the mess, then the input images need lots of different outfits and you'll need to describe these in the captions.
Sadly not all LoRAs play well together so it's gonna be a bit of trial and error!
BTW which checkpoint of SDXL did you use for generating these images with your LoRA?
Nice work, looks promising! Love your work generally and glad you've started experimenting with SD. Upside is that the various custom checkpoints out there often give more aesthetically pleasing images than DALL-E3, and obviously with all the plugins and LoRAs and ControlNets etc you have a lot more control. Main downside is definitely contextual awareness, which is way weaker than DALL-E. For example none of the models I've used can cope with the concept of mess pouring down from above onto someone. Only way around it is photo-bashing and ControlNets.
I'm not really using SDXL much at the moment because even with a cloud machine running comfyUI on an RTX5000 it's a bit clunky compared to 1.5, and I personally don't notice enough upside.
One thing to remember, and it's not clear either way from the pics, is that the training set needs as much variety as possible in terms of what you're going to want to alter in future, for example if you're wanting to vary the outfits and just keep the mess, then the input images need lots of different outfits and you'll need to describe these in the captions.
Sadly not all LoRAs play well together so it's gonna be a bit of trial and error!
BTW which checkpoint of SDXL did you use for generating these images with your LoRA?
Thanks for the feedback. I had started playing around with SD last year but the release of Bing/Designer sort of made that redundant until they upped their filters. The filter is just too strong for the types of images I want to work with. Fortunately, I've got a massive collection of images from Bing/Designer I'm using for my initial data set. Trying to keep keep outfits/backgrounds etc as varied as possible to prevent overbaking any individual element. SDXL allows a bit more flexibility around the prompting and hoping to stay away from specific trigger words for my main lora. If I want anything specific I'll look to train a separate one that I can merge in. That said, its a lot harder to train a concept Lora vs a single key word. The captioning is what takes the time. Hoping to train an 80 image model tonight so I'll be able to reduce the number of repeats right down. I'm using Vision GPT to do initial captioning then tweaking manually which is saving loads of time. It's going to be a long process though to get where I want. Context awareness just isn't there yet compared to bing. Multi-person is iffy too. SD-Cascade just released, it'll be interesting how that improves things. ideally longer term, a prompt interpreter running local on a LLM like mixtrel would be awesome.
Ohh, I'm using SDXL base for the lora but using juggernautXL for the image gen.