Despite all of GPT-3’s flair, his release may seem unrelated to reality, as if he doesn’t know what he’s talking about. This is because it is not. By anchoring text in images, researchers at OpenAI and elsewhere attempt to give language models a a better understanding of everyday concepts that humans use to make sense of things.
DALL · E and CLIP approach this problem in different directions. At first glance, CLIP (Contrastive Language-Image Pre-training) is another image recognition system. Except that he learned to recognize images not from examples labeled in organized data sets, as most existing models do, but from images and their captions taken from the Internet. He learns what’s in an image from a description rather than a one-word tag like “cat” or “banana”.
CLIP is trained by having it predict which caption from a random selection of 32,768 is correct for a given image. To solve this problem, CLIP learns to link a wide variety of objects with their names and the words that describe them. This then allows him to identify objects in the images outside of his training set. Most image recognition systems are trained to identify certain types of objects, such as faces in surveillance videos or buildings in satellite images. Like GPT-3, CLIP can generalize across tasks without additional training. It is also less likely for other state-of-the-art image recognition models to be misled by conflicting examples, which have been subtly altered in a way that generally confuses algorithms even though humans don’t notice. no difference.
Instead of recognizing the pictures, DALL · E (which I’m guessing is a WALL · E / Dali pun) draws them. This model is a smaller version of GPT-3 which was also trained on text-image pairs pulled from the Internet. Given a short natural language caption, like “a painting of a capybara sitting in a field at sunrise” or “a sectional view of a walnut”, DALL · E generates a lot of images that match it: dozens of capybaras of all shapes and sizes in front of orange and yellow backgrounds; row after row of nuts (but not all in cross section).
The results are striking, although still mixed. The caption “a stained glass window with an image of a blue strawberry” produces many correct results, but also some that have blue windows and red strawberries. Others contain nothing that looks like a window or a strawberry. The results presented by the OpenAI team in a blog post were not hand-sorted but categorized by CLIP, who selected the 32 DALL · E images for each caption that they believe best matches the description.
“Text-to-image is a research challenge that’s been around for quite some time,” says Mark Riedl, who works on NLP and computational creativity at the Georgia Institute of Technology in Atlanta. “But it’s an impressive set of examples.”
To test DALL · E’s ability to work with new concepts, researchers gave him captions describing objects they thought they hadn’t seen before, such as “an avocado chair” and “an illustration of a baby radish. daikon in a tutu walking on a dog. “In both of these cases, the AI generated images that plausibly combined these concepts.
The armchairs in particular all look like chairs and avocados. “What surprised me most is that the model can take two independent concepts and put them together so that you get something functional,” says Aditya Ramesh, who worked on DALL · E. because an avocado cut in half looks a bit like a high-backed chair, with the pit as a cushion. For other legends, such as “a snail made of a harp,” the results are poorer, with images that combine snails and harps in strange ways.
DALL E is the kind of system Riedl envisioned submitting to the Lovelace 2.0 test, a thought experiment he invented in 2014. The test is intended to replace the Turing test as a benchmark for measuring artificial intelligence. This assumes that a mark of intelligence is the ability to mix concepts creatively. Riedl suggests that asking a computer to draw a picture of a man holding a penguin is a better test of intelligence than asking a chatbot to trick a human into a conversation, as it’s more open and less easy to cheat. .
“The real test is seeing how far AI can be pushed out of its comfort zone,” says Riedl.
“The ability of the model to generate synthetic images from rather whimsical text strikes me as very interesting,” says Ani Kembhavi of the Allen Institute for Artificial Intelligence (AI2), who also developed a system that generates images from text. “The results seem to obey the desired semantics, which I find quite impressive.” Jaemin Cho, a colleague of Kembhavi, is also impressed: “Existing text-image generators have not shown this level of control in drawing multiple objects or DALL · E’s spatial reasoning abilities,” he says.
Yet DALL · E is already showing signs of strain. Including too many objects in a callout extends its ability to keep track of what to draw. And rephrasing a caption with words that mean the same thing sometimes gives different results. There are also signs that DALL · E is imitating the images he has encountered online rather than generating new ones.
“I’m a little suspicious of the daikon example, which stylistically suggests that he may have memorized artwork on the Internet,” says Riedl. He notes that a quick search brings up many cartoon images of anthropomorphized daikons. “GPT-3, on which DALL · E is based, is known for its memorization,” he says.
Still, most AI researchers agree that basing language on visual comprehension is a good way to make AI smarter.
“The future will be made up of systems like this,” says Sutskever. “And these two models are a step towards this system.”