The big breakthrough behind the new models is in the way images are created. The first version of DALL-E used an extension of the technology behind OpenAI’s language model, GPT-3, and produced images by predicting the next pixel in the image as if they were words in a sentence. This worked, but not very well. “It wasn’t a magical experience,” says Altman. “It’s amazing that it worked.”
Instead, the DALL-E 2 uses something called a diffusion model. Diffusion models are neural networks trained to clean images by removing pixelated noise added by the training process. The process involves taking images in many steps and replacing a few pixels at a time, until the original images are erased and you’re left with nothing but random pixels. “If you do this a thousand times, it looks like you’ve finally unplugged the image antenna cable from your TV set; it’s just snowing,” says Björn Ommer, who works on prolific artificial intelligence at the University of Munich in Germany and helped create the structure. now the diffusion model that powers Stable Diffusion.
The neural network is then trained to reverse this process and predict what a less pixelated version of a given image will look like. As a result, if you give a diffusion model a hash of pixels it will try to produce something a little cleaner. Reinstall the cleaned image and the model will produce something cleaner. Do this enough times and the model can take you from TV snow to a high definition picture.
AI art generators never work exactly the way you want them to. They often produce disgusting results that at best look like distorted stock art. In my experience, the only way to really make work look good is to add a descriptor at the end with an aesthetically pleasing style.
The trick to text-to-image models is that this process is driven by the language model, which tries to match a prompt with the images produced by the diffusion model. This pushes the diffusion model towards images that the language model considers a good match.
But models don’t make connections between text and images out of thin air. Most text-to-image models today are trained on a large dataset called LAION, which contains billions of text and image pairs scraped from the internet. This means that the images you get from a text-to-image model are a distillation of the world as represented online, distorted by prejudice (and pornography).
One last thing: There is a small but very important difference between the two most popular models, the DALL-E 2 and Stable Diffusion. DALL-E 2’s span model works on full-size images. Steady Diffusion, on the other hand, uses a technique called latent diffusion invented by Ommer and colleagues. It works on compressed versions of images encoded in a neural network known as the hidden area where only the essential features of an image are kept.
This means Stable Diffusion requires less computing power to work. Unlike DALL-E 2, which runs on OpenAI’s powerful servers, Stable Diffusion can run on (good) PCs. Much of the creativity explosion and rapid development of new applications is due to the fact that Stable Diffusion is both open source (programmers are free to modify it, build on it, and monetize it) and light enough for humans to run it. at home.
For some, these models are a step towards AGI, an overly exaggerated term that refers to artificial general intelligence or a future artificial intelligence with general purpose or even human-like abilities. OpenAI has been clear about its goal of achieving AGI. That’s why Altman doesn’t mind that the DALL-E 2 is now competing with a number of similar tools, some of which are free. “We’re here to do AGI, not image generators,” he says. “It will fit into a broader product roadmap. It’s a tiny bit of what an AGI will do.”