Researchers from Auburn University in Alabama and Adobe Research discovered the flaw when they tried to make an NLP system generate explanations for his behavior, like why he said different sentences meant the same thing. When they tested their approach, they realized that mixing up the words in a sentence made no difference in the explanations. “This is a general problem for all models of NLP,” says Anh Nguyen of Auburn University, who led the work.
The team examined several cutting-edge NLP systems based on BERT (a language model developed by Google that underpins many of the latest systems, including GPT-3). All of these systems perform better than humans on GLUE (General Language Understanding Evaluation), a standard set of tasks designed to test language comprehension, such as spotting paraphrases, judging whether a sentence expresses positive or negative feelings, and verbal reasoning.
The man bites the dog: They found that these systems couldn’t tell when the words in a sentence were mixed up, even when the new order changed the meaning. For example, the systems correctly spotted that the phrases “Does Marijuana Cause Cancer?” and “How Can Marijuana Give You Lung Cancer?” were paraphrases. But they were even more certain that “You smoke cancer, how can the marijuana lung give?” and “The lung can make smoking marijuana how do you get cancer?” meant the same thing too. The systems also decided that sentences with opposite meanings such as “Does marijuana cause cancer?” and “Does cancer cause marijuana?” asked the same question.
The only task where word order mattered was the task in which models had to check the grammatical structure of a sentence. Otherwise, between 75% and 90% of the responses from the systems tested did not change when the words were mixed.
What’s going on? Models seem to pick up a few keywords in a sentence, regardless of their order. They don’t understand the language like we do, and GLUE – a very popular benchmark – doesn’t measure actual use of the language. In many cases, the task a model is trained on does not require it to worry about word order or syntax in general. In other words, GLUE teaches NLP models to jump through hoops.
Many researchers have started using a more difficult series of tests called SuperGLUE, but Nguyen suspects he will have similar problems.
This issue was also identified by Yoshua Bengio and colleagues, who found that rearrange words in a conversation sometimes did not change the responses of chatbots. And a team from Facebook AI Research found examples of what happens with Chinese. Nguyen’s team shows that the problem is widespread.
Does it matter? It depends on the application. For one thing, an AI that always understands when you typo or say something distorted, like another human would, would be helpful. But, in general, word order is crucial in breaking down the meaning of a sentence.
fix it How? The good news is, it might not be too hard to figure out. The researchers found that forcing a model to focus on word order, training it to perform a task where word order mattered, such as spotting grammatical errors, also allowed the model to perform better on other tasks. This suggests that changing the tasks that models are trained for will make them better overall.
Nguyen’s results are another example of how the models are often very insufficient of what we think they are capable of. He thinks that highlights how difficult it is AIs that understand and reason like humans. “Nobody has any idea,” he said.