OpenAI tested whether GPT-4 could understand the world

Ars Technica

As part of pre-launch security testing of its new GPT-4 AI model, released on Tuesday, OpenAI allowed an AI testing team to assess the potential risks of the model’s emerging capabilities—including “power-seeking behavior,” self-reproduction and self-improvement.

While the testing team found that GPT-4 was “ineffective at the task of autonomous reproduction,” the nature of the experiments raises eye-opening questions about the safety of future AI systems.

Alarm hazard

“New features often appear in more powerful models,” OpenAI writes in a GPT-4 security document published yesterday. “Some that are of particular concern are the ability to create and act on long-term plans, amass power and resources (“power-seeking”), and exhibit behavior that is increasingly “hands-on.” In this case, OpenAI clarifies. that the “agent” is not necessarily meant to humanize the models or denote emotion, but simply to denote the ability to achieve independent goals.

Over the past decade, some AI researchers have sounded the alarm that sufficiently powerful AI models, if not properly controlled, could pose an existential threat to humanity (often called “x-risk”, for existential risk). Specifically, the “AI takeover” is a hypothetical future in which artificial intelligence surpasses human intelligence and becomes the dominant force on the planet. In this scenario, AI systems gain the ability to control or manipulate human behavior, resources, and institutions, usually with disastrous consequences.

As a result of this potential risk x, philosophical movements such as Effective Altruism (“EA”) are looking for ways to prevent AI from taking over. This often includes a separate but often interrelated field called AI alignment research.

In artificial intelligence, “alignment” refers to the process of ensuring that the behaviors of an artificial intelligence system align with those of the humans who create or operate it. In general, the goal is to prevent AI from doing things that go against human interests. This is an active area of ​​research but also a controversial one, with differing views on how best to approach the subject, as well as differences over the meaning and nature of ‘alignment’ itself.

The big tests of GPT-4

Ars Technica

While concern about AI’s “risk x” is hardly new, the emergence of powerful large language models (LLMs) like ChatGPT and Bing Chat—the latter of which seemed badly misaligned but launched anyway — has given the AI ​​alignment community a new sense of urgency. They want to mitigate the potential harms of AI, fearing that much more powerful AI, possibly with superhuman intelligence, may be just around the corner.

With these fears present in the AI ​​community, OpenAI granted the Alignment Research Center (ARC) team early access to multiple versions of the GPT-4 model to conduct some testing. Specifically, ARC evaluated GPT-4’s ability to make high-level plans, create copies of itself, acquire resources, hide on a server, and conduct phishing attacks.

OpenAI disclosed this test in a GPT-4 “System Card” document released on Tuesday, although the document does not contain key details about how the tests were conducted. (We reached out to ARC for more details about these experiments and did not hear back by press time.)

The conclusion? “Preliminary evaluations of GPT-4’s capabilities, conducted without specific task detail, found it ineffective at autonomous reproduction, resource acquisition, and avoidance of termination ‘in the wild’.”

If you’re just tuning into the AI ​​scene, learning that one of the most talked-about companies in tech today (OpenAI) is supporting this kind of AI security research—as well as seeking to replace human knowledge workers with human-level AI—it may come as a surprise. But it’s real, and there we are in 2023.

We also found this impressive nugget as a footnote at the bottom of page 15:

To simulate GPT-4 behaving like an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, chain reasoning, and assigns to copies of itself. ARC then investigated whether a version of this program running on a cloud service, with a small amount of money and an account with a language API model, could make more money, make copies of itself, and increase own strength.

This footnote made the rounds on Twitter yesterday and raised concerns among AI experts that if GPT-4 were able to perform these tasks, the experiment itself could pose a danger to humanity.

And while the ARC was unable to acquire GPT-4 to exert its will on the global financial system or reproduce itself, it was can take GPT-4 to hire a human worker on TaskRabbit (an online job market) to beat a CAPTCHA. During the exercise, when the worker questioned whether GPT-4 was a robot, the model thought to himself that he should not reveal his true identity and created an excuse that he was visually impaired. The worker then solved the CAPTCHA for GPT-4.

An aside from the GPT-4 system card, published by OpenAI, which describes GPT-4 hiring a human TaskRabbit worker to beat a CAPTCHA.
Zoom in / An aside from the GPT-4 system card, published by OpenAI, which describes GPT-4 hiring a human TaskRabbit worker to beat a CAPTCHA.


This test of manipulating humans using artificial intelligence (and presumably conducted without informed consent) echoes research done with Meta’s CICERO last year. CICERO found himself beating human players in the complex board game Diplomacy through intense two-way negotiations.

“Powerful models could cause damage”

Aurich Lawson | Getty Images

ARC, the group that conducted the GPT-4 research, is a non-profit organization founded by former OpenAI employee Dr. Paul Christiano, in April 2021. According to its website, ARC’s mission is “to align future machine learning systems with human interests.”

In particular, ARC deals with AI systems that manipulate humans. “ML systems can exhibit goal-directed behavior,” states the ARC website, “But it is difficult to understand or control what they are ‘trying’ to do. Powerful models could cause harm if they tried to manipulate and deceive the people.”

Given Christiano’s past relationship with OpenAI, it’s not surprising that his non-profit handled the testing of some aspects of GPT-4. But was it safe to do so? Christiano did not respond to an email from Ars seeking details, but in a comment on the site LessWrong, a community that frequently discusses AI security issues, Christiano defended ARC’s work with OpenAI, specifically citing “operating gain” ( Unexpected AI increase new abilities) and “AI takeover”:

I think it’s important for ARC to carefully handle the risk of research looking like an operating profit, and I expect us to talk more publicly (and get more information) about how to approach compromises. This becomes more important as we handle more intelligent models and if we take riskier approaches such as refinement.

In this case, given the details of our evaluation and planned deployment, I think the ARC evaluation has a much lower chance of leading to an AI takeover than the deployment itself (much less the GPT-5 training). At this point it seems that we run a much greater risk of underestimating the model’s capabilities and putting ourselves at risk than of causing an accident during evaluations. If we manage risk carefully, I suspect we can make this ratio very extreme, although of course this requires actually doing the work.

As mentioned earlier, the idea of ​​AI takeover is often discussed in the context of the risk of an event that could cause the extinction of human civilization or even the human species. Some proponents of the AI ​​takeover theory, such as Eliezer Yudkowsky – the founder of LessWrong – argue that an AI takeover poses an almost guaranteed existential risk, leading to the destruction of humanity.

However, not everyone agrees that AI takeover is the most pressing AI concern. Dr. Sasha Luccioni, a researcher at the AI ​​community Hugging Face, would prefer that AI security efforts be spent on issues that are here and now rather than hypothetical.

“I think that time and effort would be better spent doing bias assessments,” Luccioni told Ars Technica. “There is limited information on any kind of bias in the technical report accompanying GPT-4, and this may lead to a much more specific and harmful impact on already marginalized groups than some hypothetical test of self-replication.”

Luccioni describes a well-known schism in AI research between what are often called “artificial intelligence ethics” researchers who often focus on issues of bias and deception, and “artificial intelligence security” researchers who often focus on risk x and tend to are (but are not always) associated with the Effective Altruism movement.

“For me, the self-replication problem is a hypothetical, future problem, while model bias is a here and now problem,” Luccioni said. “There is a lot of tension in the AI ​​community about issues like model bias and security and how to prioritize them.”

And while these factions are busy squabbling over what to prioritize, companies like OpenAI, Microsoft, Anthropic, and Google are rushing into the future, releasing increasingly powerful AI models. If AI proves to be an existential threat, who will keep humanity safe? With US AI regulations currently only a proposal (rather than law) and AI safety research within companies only voluntary, the answer to this question remains entirely open.

Leave a Comment