Rendering a Whipping Scene Using AI
Part 2 - Rendering The Base Image
Last time, we prepared the base image using Daz3D and Blender. Now we will run it through AI to make it photorealistic.
(Disclaimer: Although I use ComfyUI for the flexibility it provides, you can use other popular frontends like A1111 to achieve similar results. Also, note that what I share here isn’t the only way to generate decent results and it may not be even the best method. As such, I recommend using this series to get general ideas instead of blindly following the suggested steps.)
To convert a non-photo source into a photo-like result, it is best to start from a blank image (i.e. txt2img instead of img2img, if you use A1111). The reason is that AI tries to make the overall image consistent which means it will try to mimic the source’s style if it is present in the input. But we still want to keep the content of the source image without it affecting the photographic style. And that’s where ControlNet comes in.
There are many different types of ControlNet models and I chose depth and sketch for my render. I included depth because the original composition exploited the difference of distances to the camera of each characters. I wanted to give it a feeling that you’re looking from the perspective of the victim. So I used the depth of field effect to emphasise the space between the slave and the audience behind. The depth model can be great to capture that kind of details from the source image.
I also used sketch which is good at preserving finer details like contours of the characters. Canny is another option that preserves details even better. Personally I prefer sketch if it doesn’t miss too much details since its easier to edit by hand and has less detrimental impact on the quality.
Note that any kind of “conditionings” like prompting or using ControlNets has a potentially adverse effect to the image quality. The more specific you restrict the options for the AI, the more it struggles to find a good quality image that satisfies your conditions. As such, I try to use as few and weak conditions as possible that still give me the outcome I want. That is why I didn’t use more ControlNets or set maximum strength for those used. The overall workflow of this stage looks like the following image:
It may look complicated but it’s just what more user-friendly clients like A1111 do under the hood when you run txt2img with ControlNets. The only part that may be peculiar is how I separated the “style prompt” so that I can reuse it in later stages.
You may have noted that I didn’t use a lot of words to describe the scene. The reason is that it almost never works with the current state of Stable Diffusion. One of the biggest limitations of the software is that it’s extremely difficult to depict a complex scene from a single generation process. There are several ways to mitigate the problem but none of them I’ve tried so far worked reliably for SDXL. As such, we will just render the overall scene as closely possible now and fix inevitable problems in the subsequent stages.
What is the most important in this stage is to get the overall atmosphere right. It may sound counterintuitive since things like adjusting the lighting or tone is something you’d do at the last when you work with a traditional 3D tool. But with AI, it’s more difficult to change such things that globally affect the overall scene without also changing the details that you’ve carefully constructed for main subjects. So, it’s better to make the scene look as good as possible at first then fix all the details later.
Also, AI struggles to depict details in low resolution areas. In other words, it’s unlikely that AI will get all the details correctly at one go unless you render in a gigantic resolution, which has problems of its own. So it’s usually best to approach it as an iterative process that incrementally improves the outcome instead of trying to find the best combination of prompts and settings that gives you a perfect image at one go unless it’s a simple scene.
And that was why I didn’t use any Lora to enhance details, like one that helps depicting better chains or nipples, for instance. Like prompts and ControlNets, Loras can have detrimental or unexpected effects on the overall image when used without discretion. But if you want to use a Lora that affects the global style rather than a specific subject or details it’s better to include it in this stage.
To determine the best atmosphere and lighting of the scene, you need to find out what combination of the models, prompts, and sampler settings that gives you that. As it involves many variables, it’s best to take a systematic approach than blindly changing them at random.
What I usually do is to use a fast sampler like Euler or UniPC with small steps and a fixed seed then experiment with prompts and CFG first. Don’t blindly copy and paste “best prompts” you find on the internet. You may get surprised to see how many of the words included in them have little or even negative impact on your render. Instead, it’s best to try the words one by one to see what effect each has to the outcome. It’s better to use a short prompt anyway since the longer it gets, the less significance each token gets.
CFG is also important for getting a photorealistic image. 5 is usually a good place to start. If you find your image dull or ignoring some of the words in your prompt, try raising CFG. And if you see unrealistic colours, use a lower value.
Once you are happy with the output, you can switch to a better sampler like SDE and go “seed hunting”. While keeping the number of samples low, try generating a batch of images with different seeds each. Keep doing it until you find a seed that gives you the best result and increase the samples once you’re done.
You can also use the method explained above to determine how much leeway you want to allow the AI. In my current render, for example, I decided to keep the result when the AI ignored the trees in the source image and replaced them with buildings instead. I thought the city square would be a better background for such a scene so I changed my prompt accordingly. AI also generated a night scene or one with a rainy weather for some seeds. I didn’t like them as much, so I ended up with the following image instead:
Now that we have a suitable base render, we’ll fix defects and add more details in next stages.