This guide is marked as DISCONTINUED. It remains published here as the explainer and terminology might still be educational, but the project itself — based in Google Colab and written by a third party — no longer runs as currently written. We’ll revisit periodically to check if that has been repaired, but in the interim, feedback will not be reviewed.
Reading social media and science articles as of late, you’ve probably had the misfortune of encountering surreal, sometimes nightmarish images with the description “VQGAN+CLIP” attached. Familiar glimpses of reality, but broken somehow.
My layperson understanding struggles to define what VQGAN+CLIP even means (an acronym salad of Vector Quantized Generative Adversarial Network and Contrastive Language–Image Pre-training), but Phil Torrone deftly describes it as “a bunch of Python that can take words and make pictures based on trained data sets." If you recall the Google DeepDream images a few years back — where everything was turned into dog faces — it’s an evolution of similar concepts.
GANs (Generative Adversarial Networks) are systems where two neural networks are pitted against one another: a generator which synthesizes images or data, and a discriminator which scores how plausible the results are. The system feeds back on itself to incrementally improve its score.
A lot of coverage has been on the unsettling and dystopian applications of GANs — deepfake videos, nonexistent but believable faces, poorly trained datasets that inadvertently encode racism — but they also have benign uses: upscaling low-resolution imagery, stylizing photographs, and repairing damaged artworks (even speculating on entire lost sections in masterpieces).
CLIP (Contrastive Language–Image Pre-training) is a companion third neural network which finds images based on natural language descriptions, which are what’s initially fed into the VQGAN.
It’s heady, technical stuff, but good work has been done in making this accessible to the masses, that we might better understand the implications: sometimes disquieting, but the future need not be all torches and pitchforks.
There’s no software to install — you can experiment with VQGAN+CLIP in your web browser with forms hosted on Google Colaboratory (“Colab” for short), which allows anyone to write, share and run Python code from the browser. You do need a free Google account, but that’s it.