The program, GIVE HIM, released earlier this month, can whip up images of all sorts of weird things that don’t exist, like avocado armchairs, robot giraffes, or radishes wearing tutus. OpenAI generated several images, including the Spaghetti Knight, at the request of WIRED.
DALL-E is a version of GPT-3, an AI model trained on text retrieved from the web, capable of producing surprisingly consistent text. DALL-E has received accompanying pictures and descriptions; in response, it can generate a decent mashup image.
Pranksters were quick to see the funny side of DALL-E, noting, for example, that he could imagine new types of British cuisine. But DALL-E is based on an important advance in the field of AI computer vision, which could have serious and practical applications.
Called CLIP, it consists of a large neural network– an algorithm inspired by how the brain learns – has fed hundreds of millions of images and associated text captions from the web and trained to predict the right labels for an image.
OpenAI researchers discovered that CLIP could recognize objects as precisely as algorithms trained in the usual way – using organized data sets where the images fit the labels perfectly.
As a result, CLIP can recognize more things, and it can understand what some things look like without needing many examples. CLIP helped DALL-E produce his illustrations, automatically selecting the best images from those he generated. OpenAI has published an article describing how CLIP works as well as a small version of the resulting program. He has yet to release any paper or code for DALL-E.
DALL-E and CLIP are “super awesome,” says Karthik Narasimhan, assistant professor at Princeton specializing in computer vision. He says CLIP builds on previous work that sought to train large AI models using images and text simultaneously, but is doing so on an unprecedented scale. “CLIP is a large-scale demonstration of the ability to use more natural forms of supervision – the way we talk about things,” he says.
He says that CLIP could be commercially useful in many ways, from improving image recognition used in web search and video analytics, to making robots or more intelligent autonomous vehicles. CLIP could be used as the starting point of an algorithm for robots to learn from images and texts, such as instruction manuals, he says. Or it might help a autonomous car recognize pedestrians or trees in unfamiliar surroundings.