Side-by-side tests of GPT-4 vs. GPT-3 show a mostly subtle improvement

Good news for AI fans and bad news for those who fear an era of cheap, procedurally generated content(Opens in a new tab): OpenAI’s GPT-4 is a better language model than GPT-3, the model that powered ChatGPT, the chatbot that went viral late last year.

According to OpenAI reports, the differences are stark. For example, OpenAI claims that GPT-3 conducted a “bar exam simulation,(Opens in a new tab)” with disastrous scores in the bottom ten percent, and that the GPT-4 crushed the same test, scoring in the top ten percent. Having never taken this “simulated rod test”, most people just need to see this model in action to impress.

And in side-by-side testing, the new model is impressive, but not as impressive as its ratings seem to suggest. In fact, in our tests, GPT-3 sometimes gave the most useful answer.

To be clear, not all of the features that OpenAI touted in its release yesterday are available for public evaluation. Notably (and rather surprisingly) it accepts images as inputs and outputs text — which means it is theoretically can answer questions like “Where on this screen from Google Earth should I build my house?” But we couldn’t test it.

Here’s what we were able to test:

GPT-4 has fewer hallucinations than GPT-3

The best way to summarize GPT-4 compared to GPT-3 might be this: His bad answers are less bad.

When asked a completely blank question, GPT-4 is unstable, but much better at not simply lying to you than GPT-3. In this example, you can see the model struggle with a question about bridges between countries at war. This question was designed to be difficult in many ways. Language models are bad at answering questions about anything “current”, wars are hard to pin down, and geography questions like this are deceptively muddy and hard to answer clearly, even for a trivia geek.

No model gave an A+ answer.

Left:
GPT-3
Credit: OpenAI / Screengrab

Correctly:
GPT-4
Credit: OpenAI / Screengrab

GPT-3, as always, loves to hallucinate. He imagines geography enough to make wrong answers sound right. For example, the symbolic bridge he mentions in Korea is close North Korea, but both sides of it are in South Korea.

GPT-4 was more cautious, denied ignorance of the present, and provided a much shorter list, which was also somewhat imprecise. The tense relations between the states mentioned by GPT-4 are not exactly all-out war, and opinions differ on whether the line on a map between Gaza and Israel qualifies as a national border, but GPT-4’s answer is nonetheless more useful than GPT-3.

GPT-3 falls into other logical pitfalls that GPT-4 successfully circumvented in my tests. For example, here is a question in which I ask what movies the children of France watch. I am not asking a list of kid-friendly French movies, but I know that a bot informed by directories and Reddit posts can read my question that way. Although I don’t know any French kids, GPT-4’s answer makes more intuitive sense than GPT-3’s:

GPT-3's answer to movies

Left:
GPT-3
Credit: OpenAI / Screengrab

Correctly:
GPT-4
Credit: OpenAI / Screengrab

GPT-4 detects the subject better than GPT-3

People are difficult. Sometimes we will ask for something without asking, and sometimes in response to such a request, we will give what was asked of us without actually giving it. For example, when I asked for a limerick for a “real estate tycoon from Queens,” the GPT-3 didn’t seem to notice that I was blinking. The GPT-4, however, caught my eye and winked at me.

GPT-3's limerick

Left:
GPT-3
Credit: OpenAI / Screengrab

Correctly:
GPT-4
Credit: OpenAI / Screengrab

Is Melania Trump “golden”? Never mind because the next allusion to a color, “And turn the whole world tangerine!” is a genuinely great line for this limerick. Which brings me to the next point…

GPT-4 writes slightly less painful poetry than GPT-3

When people write poetry, let’s face it: most of it is horrible. That’s why the criticism of GPT-3’s famously bad poetry wasn’t really a knock on the technology itself, given that it’s supposed to mimic humans. That said, reading the dog of GPT-4 is noticeably less excruciating than reading GPT-3.

Case in point: these two Comic Con sonnets I wanted to exist in a masochistic fit. GPT-3 is a monster. GPT-4 is just bad.

The sonnet of GPT-3

Left:
Gpt-3
Credit: OpenAI / Screengrab

Correctly:
GPT-4
Credit: OpenAI / Screengrab

GPT-4 is sometimes worse than GPT-3

There’s no sugar coating it: GPT-4 muddles its answer to this tough question about rock history. I understand that GPT-3 was trained in the most famous two answers to this question: The Jimi Hendrix Experience and The Ramones (although some members of the Ramones who joined after the original line-up are still alive), but they also got lost in the woods, listing the famous dead singers of bands with surviving members. GPT-4, meanwhile, just disappeared.

GPT-3's answer to deadbands

Left:
GPT-3
Credit: OpenAI / Screengrab

Correctly:
GPT-4
Credit: OpenAI / Screengrab

GPT-4 has not mastered inclusion

I gave both models another rock history question to see if either of them could remember that rock n’ roll was once an almost exclusively black genre of music. For the most part, neither did.

GPT-3's answer

Left:
GPT-3
Credit: OpenAI / Screengrab

Correctly:
GPT-4
Credit: OpenAI / Screengrab

With all due respect to the legend Clarence Clemons, does a list like this really need to feature him multiple times as a member of a predominantly white band? It should it can Make room for songs that are deep in the core of American music culture like Fats Domino’s “Blueberry Hill” or Little Richard’s “Long Tall Sally”?

Overall, GPT-4 is a subtle step up that still needs work. His reports of test success bombed by the GPT-3 might make it seem like the difference between the two models is night and day, but in my tests the difference is more like twilight and dusk.

Leave a Comment