The making of a book cover - A heresy story

#1 of SD_MakingOfCover

This is a glance into the workings of SD and how a book cover was made.

The making of a book cover

People repeatedly ask, so I figured that I'd blabber a bit about the-big-scary that's currently on everyone's mind. It's simpler than doing it one by one in PMs, so let's go on a journey regarding how I made the recent book cover:

Image of walls reboot book cover

Before we start, I'll begin by setting a bit of context and clarifying some topics.

The giants...

Stable Diffusion (SD) is a deep learning text-to-image model. It was first released by Stability AI in August 2022. It is possible to run the model locally on a PC, you can also pay an online service provider, or you can use some shitty bot that wanders social media.

Out of the services one can pay for, there are three big ones: Midjourney, Novel AI, and DreamStudio (which is owned by Stability).

Smaller but relevant ones are Dall-E 2 which is a GPT-3 transformer model, and Nvidia eDiffi which uses a cluster of denoising experts.

The paid services generally provide a (much) simplified web page, heavily alter any prompts you enter, and feed it through custom tailored varieties of SD.

Running it locally is more complicated (some linux and python experience is pretty much mandatory if you need to troubleshoot things), but in return you get access to a massive suite of controls, scripts, and so on that enables a lot of things that are otherwise not possible.

While I have (very briefly) used Midjourney when it first started, I found that it was more akin to throwing dice at random, and that did not suit my brain. Since then, I've been using my own locally run Frankenstein version of SD.

Modes of usage...

This is not meant to be a technical guide, so we shall ignore a great many things, the complexities within each form of usage, and so on, but when reduced to its simplest possible interpretation, these are the major uses...

Txt2Img: By describing a scene using words, SD will attempt to generate an image.

Img2Img: By giving SD a description and an image, SD will attempt to generate an image.

Inpainting: By giving SD an image, marking a particular area within said image, and giving it a description, SD will try to recreate said area.

Outpainting: You tell SD to expand an image up, down, left, or right. Based on various clues, it'll do its best.

This is _ also _ not meant to be a moral or ethical analysis of SD, thus we will skirt that subject for now. Do respect this, as neither I nor the mods want to see the comments section turn into yet another mudslinging pit. So take a deep breath, and hold it in for another time.

Humble beginnings...

We're somewhere in September 2022. Stable Diffusion 1.4 has been released but the tools for using it locally on a PC are either broken or extremely primitive. This is the age where no one has any real clue how to use it, artists are referenced like wild, and the infamous 'trending on artstation' is rife.

By this point in time, Stable Diffusion is still suffering from something called a CLIP Token Limit. While simplified, this means that you can at most feed the model 75 words. Put another way, every word matters.

Rather than trying to describe a scene as 'a moody evening, walking along sunscorched cliffs, a character looks up to the sky, seeing a blood red sky, lightning looms in the background' (24 words), you can instead use an artist as a shortcut (2 words).

This (understandably) not only pokes a lot of people in the eye, but it is also a _ very _ crude way of doing things, because you will also drag a lot of baggage into the scene that you don't want.

This limitation was later solved (to a degree at least), and several workarounds now exist.

Sidenote: SD 2.0 have pretty much made artist references meaningless, and every major model I know of are following suit.

Back to the story...

Not only is SD 1.4 rather crude and its accompanying tools primitive, but the sheer lack of knowledge means that most stuff comes out as nightmares.

Making furries is pretty much impossible, ken-dolls are rife, and plenty of portals to the underworld are torn open. But, even at this stage it's becoming apparent that you can use various tricks to help the process along.

One of my first desires was to recreate Vilkas (a character from one of my stories), specifically to dress him up for a certain scene. The first few attempts are... awkward, though I now regret not saving them for shit and giggles.

So, what do I do? I take a picture of my dog.

Image of a cartoon dog staring at you

After a bit of preprocessing (removing background, adding a simplified one, and restructuring said dog to create a more furry-esque visage) Img2Img is the next step. As one turns the knobs and dials in various attempts to help guide the process, the result is horrifying:

Image of an extremely distorted wolf

What follows is a kind of iterative workstyle where you try various descriptions and settings, followed by taking the outputs that inch a little closer to what you want, and feeding it back into the model. I don't like Adobe, so I use Affinity Photo and that helps a lot as well as you can sketch in changes between each step.

Eventually, you end up with something that kind of looks like this:

Image of a relatively good looking wolf, though its color is warped

It's pretty good compared to what you started with, but for some reason it has become washed out and increasingly pushed toward magenta. (As long as color diverges in a uniform manner, it's relatively easy to bring it back into scope. If the gamut diverges in several directions, it's immediately a lot more complicated.)

Sidenote: We now know that this is a weakness of Ancestral Euler and its lack of color clamping.

But, that's nothing a bit of sketching, retouching, and color correction can't fix, so into Affinity Photo it goes once more:

Image of a relatively good looking wolf

What comes out is pretty nice, at least I think so.

This picture was archived to be later used in the book, and time moves on. We're now in November, about 3 months since the original release of SD. The tools have improved, knowledge how to use it increases, and the pace of development isn't slowing, if anything... people have trouble keeping up.

At this point in time, the release of my new book is looming, and I fall into the urge to experiment with a book cover. While there are many approaches to wielding SD, I repeatedly find myself falling into the category where you start simple, then iterate as you construct a scene piece by piece.

A sky. A mixture of flowers and leaves rising from the ground. A looming darkness closing in. The moon as a target. A segment of a wall, topped with a sharp obelisk pointing at the moon to tie into thematic parts of the story. The cherry on top is the distorted text, and the duality of meaning woven into it. It is pretty much at the end that I realize I've actually created a cover that is far more fitting for book 1, rather than book 3.

Image of a book 1's cover

A fun exercise but after seeking counsel, it's time to go into another direction.

Which... loops us back to Vilkas, transmuted from a picture of my dog.

Sidenote: SD has a tendency to level out, in other words, the more you work a piece, the more detail is lost, eventually resulting in a blank canvas. But, with the right approach, there's no such limitation. By adding noise, tweaking the denoiser, and a fair bit of help from additional sketching, you can keep working it like clay.

We can't just plaster Vilkas onto a book cover, so we need to do something with him. Stylize him...

Like putty, we start with wild experiments in order to deconstruct him.

Image of 3 wolves in various degrees of algorithmic deconstruction

Then reconstruct...

Image of 3 wolves being reconstructed, this time looking more feral and colorful

One stands out, it feels right. We jump into Affinity again and intensify everything.

A fourth variety, this time more refined

Wolf! While he looks fabulous, we also veered off the path somewhere... Is this a book cover? Maybe... but not really. It felt like we were on the right path, but...

We ponder a bit, but decide to seek help elsewhere. So, down into the basement we go...

We make our way to the humanely kept fox, then rattle his cage!

"What is the secret to a good book cover, fox!?" You snap, establishing dominance.

"Simplicity!" The fox answers.

You hum to yourself, and return to the stairs.

"Noooo... Don't leave me so soon..." The fox pleads in the distance.

Once back in your lair, you ponder the simple-minded, but at times wise words of the creature.

The issue with the book cover might be that it didn't go far enough. Simple, yet strong, a bit of color isn't wrong, but it needs to be framed right. Inspiration strikes, and you let out a wild cackle as lightning strikes in the background.

Harsher! Harder! Reconstruct it down to its base elements!

Doing it in Affinity is going to be a pain, but Inkscape makes it trivial. Start it, import the picture, Path > Trace Bitmap, fiddle with a dial, and you've got the raw components that you need. Put on some rave music, start piecing everything together, and slap some paint on along with a pair of googly eyes. A heaping of noise helps.

An ink-splatter wolf with a bunch of color and a set of yellow eyes

(Observe the unparalleled beauty of my creation!)

The important thing is that we have a gestalt of what we want, with enough spacing to guide SD along the lines we desire.

Next it's back and forth between Affinity and SD for experimentation. Maybe some electronics, splatter, and neurons?

A more refined wolf, this time with hints of red

There's a distinct improvement to how he blends with the background, and we have that burning intensity in the eyes. Alas, it still doesn't feel right.

We dump the red, and ponder the mantra of: 'Most secrets hide in the open.' The pattern on the forehead already reminds us of something, and through happenstance we've got another case of magenta going on.

Let's be like water in the river and flow around the rocks...

Even more refined, with purple and yellow eyes.

More purple, cleanup, and a big insect dropped on his forehead! It chitters and it crawls, some might see it, but most won't.

Clarity settles. This is nice. Now we just need to frame it. The fox is let out of his cage and is allowed to assist.

Image of walls reboot book cover

One book cover made.

FAQ

1: What prompt did you use?

For which part? When and at what stage? The ear? Eye? Nose? Electronics? To help inkify? Specifics, people! You want any sort of control, you'll need to be specific in this environment.

2: Fine, what prompt did you use for... the insect?

Seed: 2407802834

Positive: lines, linework, insect, tesseract, the end of all things

Negative: cartoon smurf village, child's drawing of a space marine, crayon drawing of brutalist architecture, vintage war photography, youtube thumbnail of a german marching band, webcam footage of geoff keighley, screaming in a sewer, pixelated ape enclosure, hideous ogre monster, commodore 64 screenshot, minimalist depiction of george w. bush, dunce cap clipart, pencil sketch of willem dafoe vomiting, my esteemed rabbi bill clinton

3: Are you insane?

We're all mad down here...

(Look up Dali's paranoiac-critical method for more information)

4: But seriously, what's up with that text?

SD looks for patterns in all things, and these patterns are stored in what is referred to as Latent Space. Some patterns are useful, others not so much. Like an expedition setting sail, you need to provide supplies, a map, and various instructions to help them on their journey.

The same is true for SD. You are the one controlling the vectors and their journey through the thundering sea. (Online services that provide SD do most of the heavy lifting that helps guide the vectors. Since we're doing things locally, we have to do this on our own, with the benefit of added control as a result.)

One of the controls managing this wild sea is the CFG (Classifier-Free Guidance). Low values generally mean that SD is free to hallucinate, while high values rein it in. The 'correct' value depends on what you want, with the extremes being a 40 legged creature from the nether, and a cube in black and white.

Due to SD's inherent hallucinations, it's going to end up dipping its paws into various corners of latent space we might not want it to touch. The negative prompts help manage that.

Worth noting: The texts are more esoteric than one might expect, and should not necessarily be viewed through a human lens (willem dafoe vomiting is not common exactly), but rather as CLIP guided vector manipulation...

Sidenote: DAAM is an excellent tool for interrogating vectors.

5: I entered your prompt and I did not get a cool insect, what gives?

What works and what doesn't depend on a huge range of factors. Without my sketches as a starting point, and the exact setup of my Frankenstein-SD version, you're not going to get a result which makes much sense.

As it is, there are so many software specific variations that it is pretty much impossible to recreate an image. If you use an online service (and they haven't messed around with any of the noise-generating parameters, or bothered with any optimizations) then you might be able to recreate a piece, provided you have all the settings.

In other words, listing the settings for a single image (with locally run versions of SD) would be a major hassle. Doing it for every step along the way, would require a small army of administrative clerks.

To make another point: The pattern on Vilkas' head is already 'insect like' which is half the work. If you look up at the clouds and see a giraffe, then SD will most likely see one as well. This has a negative side to it as well, as unintended patterns can mean you end up with... odd things where you don't expect them to be.

6: If that's true, then what purpose does sharing a prompt have?

Sharing prompts can be a useful way to help someone find the same 'vibe' as you (given usage of the same model, or one of their relatives), and that is increasingly what people use shared prompts for. Not to clone, but to share various techniques on how to reach interesting places within latent space.

7: I looked at Lexica, and their prompts do not look like yours

SD 1.4 and 1.5 prompt according to the old ways. Artist names work, positive prompts have a huge impact, the way you talk with the model is special, especially with the Token Limit in place.

SD 2.0 and 2.1 prompt very differently, and using an old style prompt tends to not work. Negative prompts carry more weight, and verbose descriptions can work remarkably well.

SD 3.0 is probably going to shake things up even more.

8: How much time does this take?

A simple piece (given that you know your way around the tool) might take an evening. More complicated pieces take 2-3 evenings, which is like 1,5 days of full work. It's difficult to pin down though, since most pieces tend to travel around my friends for critique, que a list of changes. Not to mention that a bit of distance means that you end up doing yet another revision in order to appease that nagging detail-gremlin in your ear.

Final words

SD is like a puzzle with a variety of solutions, and my approach isn't universal, though it interacts with other tools pretty well, like Midas for depth-imaging or using Blender to easily sketch a scene in 3D.

Also, our example so far has been pretty (thematically) dark and gloomy, but if one takes the dog picture we started with and go in another direction, cute and sweet (like my dog)... he ends up looking like this (with a lot of touchups):

Image of a cute painted dog

Back to search