Technology research firm OpenAI has just released an updated version of its text-generating artificial intelligence program, called GPT-4, and has shown off some of the language model’s new capabilities. Not only can GPT-4 produce more natural-sounding text and solve problems more accurately than its predecessor. It can also edit images in addition to text. However, the AI is still vulnerable to some of the same problems that plagued previous GPT models: showing bias, overcoming guardrails designed to prevent it from saying offensive or dangerous things, and “hallucinating” or confidently creating lies that were not found in the training data. .
On Twitter, OpenAI CEO Sam Altman described the model as the company’s “most capable and aligned” to date. (“Aligned” means designed to follow human morality.) But “it’s still flawed, it’s still limited, and it still looks more impressive when you first use it than when you spend more time with it,” he said. he tweeted. OpenAI representatives could not be reached for comment at the time of this article’s publication.
Perhaps the most significant change is that GPT-4 is “multimodal”, meaning it works with both text and images. Although it cannot produce images (as can artificial intelligence models such as DALL-E and Stable Diffusion), it can process and respond to the visual inputs it receives. Annette Vee, an associate professor of English at the University of Pittsburgh who studies the intersection of computing and writing, watched a demonstration in which the new model was told to identify what was funny in a humorous image. Being able to do so means “understanding the context in the picture. It’s understanding how an image is composed and why and connecting that to social understanding of language,” he says. “ChatGPT was unable to do this.”
A device with the ability to analyze and then describe images could be extremely valuable to the visually impaired or blind. For example, a mobile app called Be My Eyes can describe objects around a user, helping those with low or no vision to interpret their surroundings. The app recently incorporated GPT-4 into a “virtual volunteer” that, according to a statement on OpenAI’s website, “can generate the same level of context and understanding as a human volunteer.”
But GPT-4’s image resolution goes beyond image description. In the same demo Vee attended, an OpenAI representative sketched an image of a simple website and fed the design to GPT-4. The model was then asked to write the code needed to produce such a site — and it did. “It basically looked like what the picture is. It was very, very simple, but it worked pretty well,” says Jonathan May, a scientific associate professor at the University of Southern California. “That was nice.”
Even without its multimodal capability, the new program outperforms its predecessors on tasks that require logic and problem solving. OpenAI says it has run both GPT-3.5 and GPT-4 through a variety of tests designed for humans, including a bar exam simulation, the SAT and Advanced Placement tests for high school students, the GRE for college graduates, and more and for pairs of sommelier exams. The GPT-4 achieved human-level scores on many of these benchmarks and consistently outperformed its predecessor, though it didn’t get everything right: it performed poorly on the English language and literature tests, for example. However, its extensive problem-solving ability could be applied to any number of real-world applications—such as managing a complex program, finding errors in a block of code, explaining grammatical nuances to language learners, or detection of security vulnerabilities.
In addition, OpenAI claims that the new model can interpret and produce larger blocks of text: more than 25,000 words at once. Although earlier models were also used for long-term applications, they often lost track of what they were talking about. And the company touts the new model’s “creativity,” described as its ability to produce different kinds of artistic content in specific styles. In a demonstration comparing how the GPT-3.5 and GPT-4 emulated the style of Argentine author Jorge Luis Borges in the English translation, Vee noted that the latest model produced a more accurate effort. “You have to know enough about the context to judge it,” he says. “An undergraduate might not understand why it’s better, but I’m an English teacher… If you understand it from your own domain and it’s impressive in your domain, then that’s impressive.”
May has tested the creativity of the model himself. Try the playful task of ordering it to create a “backronym” (an acronym arrived at by starting with the abbreviated version and working backwards). In this case, May asked for a cute name for his lab that would read “CUTE LAB NAME” and that would also accurately describe his research field. GPT-3.5 failed to generate an associated tag, but GPT-4 succeeded. “Found in ‘Computational Understanding and Transformation of Expressive Linguistic Analysis, Bridging NLP, Artificial Intelligence and Machine Learning,'” he says. “Engineering Education is not great. the “intelligence” part means there’s an extra letter in there. But honestly, I’ve seen a lot worse.” (For context, the real name of his lab is CUTE LAB NAME, or Center for Useful Techniques to Enhance Language Applications Based on Nature and Meaning). In another test, the model showed the limits of its creativity. When May asked him to write a specific kind of sonnet—he requested a form used by the Italian poet Petrarch—the model, unfamiliar with this poetic arrangement, defaulted to the sonnet form favored by Shakespeare.
Of course, solving this particular problem would be relatively simple. GPT-4 just needs to learn an additional poetic form. In fact, when people push the model to fail in this way, it helps the program grow: it can learn from everything informal testers feed into the system. Like its less fluent predecessors, GPT-4 was first trained on large data surfaces, and this training was then refined by human testers. (GPT stands for Genetic Pretrained Transformer.) But OpenAI has been secretive about how it made GPT-4 better than GPT-3.5, the model that powers the company’s popular ChatGPT chatbot. According to the document published alongside the release of the new model, “Given both the competitive landscape and the security implications of large-scale models such as GPT-4, this report does not contain further details about the architecture (including the size of the model), the hardware, training computation, data construction, training method, or the like.” OpenAI’s lack of transparency reflects this newly competitive AI production environment, where GPT-4 must compete with programs such as Google’s Bard and Meta’s LLaMA. The document goes on to suggest, however, that the company eventually plans to share such details with third parties “who can advise us on how to weigh competition and safety issues … against the scientific value of further transparency”.
These security issues are important because the smartest chatbots have the capacity to cause harm: without safeguards, they can provide a terrorist with instructions on how to build a bomb, send threatening messages for a harassment campaign, or provide disinformation to a foreign agent trying to sway. an election. Although OpenAI has set limits on what its GPT models are allowed to say in order to avoid such scenarios, determined testers have found ways around them. “These things are like bulls in a china shop – they’re powerful, but they’re reckless,” said scientist and author Gary Marcus. Scientific American shortly before the release of GPT-4. “I do not think [version] four will change that.”
And the more human-like these robots become, the better they trick people into thinking there’s a sentient agent behind the computer screen. “Because it mimics [human reasoning] Well, through language, we think that – but under the hood, it doesn’t look anything like the way people do,” warns Vee. If this illusion tricks people into believing that an AI agent is performing human logic, they may trust its responses more easily. This is a major problem because there is still no guarantee that these answers are accurate. “Just because these models say something, it doesn’t mean that what they say is [true]May says. “There is no database of responses from which these models draw.” Instead, systems like GPT-4 generate an answer one word at a time, with the most plausible next word informed by their training data—and that training data may be out of date. “I think GPT-4 doesn’t even know it’s GPT-4,” he says. “I asked about it and he said, ‘No, no, there is no GPT-4. I’m GPT-3.”
Now that the model has been released, many researchers and AI enthusiasts have the opportunity to explore the strengths and weaknesses of GPT-4. Developers who want to use it in other applications can apply for access, and anyone who wants to “talk” to the program will need to sign up for ChatGPT Plus. For $20 per month, this paid plan lets users choose between chatting with a chatbot running on GPT-3.5 and one running on GPT-4.
Such explorations will undoubtedly reveal more potential applications—and flaws—in GPT-4. “The real question should be, ‘How will people feel about this two months from now, after the initial shock?'” says Marcus. “Part of my advice is: let’s temper our initial excitement by realizing that we’ve seen this movie before. It’s always easy to make a demo of something. it’s hard to make it a real product. And if it still has these problems – around the illusion, not understanding the physical world, the medical world, etc. .etc – that will limit its usefulness somewhat. And it will still mean that you have to pay close attention to how it’s used and what it’s used for.”