As mentioned on the prior page, a series of jobs are run top-to-bottom, but on the first pass we’ll “Run all” to automate this. Some of the initial jobs are non-interactive, so scroll down a bit and we’ll start fiddling mid-form before setting it in action…
(“Selection of models to download”)
This is where one selects one or more pre-trained models for the VQGAN. These models were assembled by various research groups, trained from different sources, some for broad use or others tuned to specific purposes such as faces.
Only one model is active at a time, but you can download more than one if trying some A/B comparisons through multiple runs. Some of these models are truly massive or are hosted on bandwidth-constrained systems, so choose one or two carefully, don’t just download the lot.
By default, imagenet_16384 is selected — it’s a good general-purpose starting point, trained from a large number of images prioritized by the most common nouns.
You can Google around for explanations on most of these, but for example…
ade20k is tuned to scenes, places and environments. This might be best for indoor scenes, cityscapes or landscapes.
ffhq is trained from a set of high-resolution face images from Flickr. You may have seen this used to make faces of “nonexistent people.”
celebahq is similar, though specifically built from celebrity faces.
Whatever model(s) you select here, you’ll specifically need to activate one of them in a later step…
The VQGAN model does all the “thinking,” but this is where you steer the output. If doing multiple runs, you’ll be returning to this section, editing one or more values, and clicking the “run” button to validate the inputs (but not yet generate any graphics).
The fields in this form include:
textos (texts): use this field to describe what you’d like to see in plain English. The “CLIP” part of VQGAN+CLIP processes text into images to feed the “VQGAN” part.
More detailed is generally better. “Carl Sagan” could go anywhere, but “Carl Sagan on a beach at sunset” provides a lot more context to work against.
With the generalized models, a popular addition is “in the style of,” where you can transmogrify a subject into a facsimile of some artist’s work. “Robot pterodactyl in style of Dali.” “Heckin' chonker cat in style of Leyendecker.” “Monster Squad movie in style of Lisa Frank.” Amazing how many styles it’s able to emulate. Not just individuals, try “Unreal engine,” “stained glass,” “Don Bluth,” “Pixar” and others. The more absurd the combination, the better.
You can also separate prompts with a vertical bar. “McRib | low poly” or “San Diego Supercomputer Center | HR Giger”
Anything with “Adafruit” eventually sprouts LEDs and pink hair.
ancho, alto (width, height): dimensions of the resulting image or video, in pixels. The default values are both 480, for a modest sized square image. Larger images take geometrically more time and resources to process, so consider staying close to this pixel count. For example, I use 640x360 a lot … it’s the same number of pixels, but a 16:9 aspect ratio, great for video or for Twitter’s single-image crop.
modelo (model): which VQGAN model to use for image reconstruction. Different models are tuned to different purposes. The corresponding dataset must have been previously downloaded (see “Selection of models” above).
intervalo_imagenes (image_range): how many iterations of the VQGAN between preview images displayed in the browser, letting you see progress as the subject comes into focus. Every iteration is actually stored for producing a video later, this is just how often we see a frame in progress (you can also right-click and save any of these individually, if you just want a few stills instead of video). Default is 50.
imagen_inicial (initial_image): normally this field is blank and the process starts with a little random noise, and the VQGAN interprets this like you or I finding faces in clouds. Optionally here, rather than noise, you can provide an image as a starting point and have a little say over composition. This is also sometimes used to feed the output of one VQGAN model as the input into another. As explained on the prior page, images can be uploaded in the “Files” section (left margin), and referenced in this field by filename.
imagenes_objetivo (target_image): also normally blank, here you can optionally provide an image to steer the VQGAN toward, rather than from.
You can use either or both of these fields, or even the same image for both … sometimes the process quickly strays from the original composition, but this is a way to help keep it on track, as seen here:
seed: this provides a starting point for the random number generator. The default of -1 tells it to use a random seed — you’ll get different results each time, even with all other values the same. Supplying a number allows prior results to be reproduced. If you started random, but like the results and want to reproduce it at a different size or make a longer video, you can see the randomly-chosen seed when running the “execute” job (explained on next page), and copy-and-paste that into the parameters for the next run.
max_iteraciones (max_iterations): How many VQGAN iterations to run before stopping. The default of -1 has it run indefinitely (you can stop the process manually when satisfied with the output). I usually set an upper limit here, maybe 800 or 1000 (most images “solidify” well before that), so I can “Run all” and do other things while it works, without the job running rampant with resources.
If you selected “Run all,” and if you set a max_iterations value (or interrupt the process manually), the project will continue to assemble and download a video of all frames up to that point. In subsequent runs, you can do those steps manually.