Sandhini Agarwal: We have many next steps. I definitely think that how viral ChatGPT has become has made a lot of issues that we knew existed really bubble up and become critical—things that we want to solve as soon as possible. Like, we know the model is still very biased. And yes, ChatGPT is very good at rejecting bad requests, but it’s also very easy to write prompts that make it not deny what we wanted it to deny.
Liam Fedus: It’s been exciting to watch users’ diverse and creative applications, but we’re always focusing on areas where we need to improve. We believe that through an iterative process where we develop, get feedback and refine, we can produce the most aligned and capable technology. As our technology evolves, new issues inevitably arise.
Sandhini Agarwal: In the weeks since release, we’ve looked at some of the most egregious examples people have come up with, the worst things humans have seen in nature. We kind of evaluated each of them and talked about how we should fix it.
Jan Leike: Sometimes it’s something that’s gone viral on Twitter, but we have some people who actually reach out quietly.
Sandhini Agarwal: A lot of things we found were jailbreaks, which is definitely a problem we need to fix. But since users have to try these complicated methods to get the model to say something bad, it wasn’t something we were completely missing or something that was too weird for us. However, this is something we are actively working on at the moment. When we find jailbreaks, we add them to our training and testing data. All the data we see feeds into a future model.
Jan Leike: Whenever we have a better model, we want to get it out and test it. We are very optimistic that some targeted adversary training can greatly improve the jailbreaking situation. It’s not clear if these problems will go away completely, but we think we can make a lot of jailbreaking much more difficult. Then again, it’s not like we didn’t know jailbreaking was possible before launch. I think it’s very difficult to really predict what the real security problems will be with these systems after you deploy them. So we put a lot of emphasis on tracking what people are using the system for, seeing what’s happening and then reacting to it. This does not mean that we should not proactively mitigate security problems when we foresee them. But yes, it is very difficult to predict everything that will actually happen when a system hits the real world.
In January, Microsoft unveiled Bing Chat, a search chatbot that many assume is a version of OpenAI’s officially unannounced GPT-4. (OpenAI says: “Bing is powered by one of our next-generation models that Microsoft tailored specifically for search. It incorporates advances from ChatGPT and GPT-3.5.”) The use of chatbots by tech giants with multibillion-dollar reputations for the protection of the creations new challenges for those charged with the construction of the underlying models.