ChatGPT Stress Test: Who and How Teaches AI to Distinguish Good From Evil

OpenAI has, for the first time, provided a detailed account of how it assesses its language models for political correctness. The company released two studies describing the process of “red-teaming,” a stress-testing method adapted from cybersecurity. OpenAI first employed this technique in 2022 during the development of DALL-E 2.

Red-teaming involves simulating attacks to test a company’s defenses. Researchers act as adversaries to uncover vulnerabilities, assessing how effectively systems and staff can detect and counter threats.

Thorough testing of AI models has become essential due to their growing popularity. According to OpenAI, modern large language models (LLMs) occasionally generate racist or misogynistic statements, disclose confidential information, or produce inaccurate content. Last month, the company shared a study on how often ChatGPT reproduces gender and racial stereotypes based on user names.

To identify potential issues, OpenAI engaged a wide network of independent testers, including artists, scientists, and experts in law, medicine, and regional politics. These testers aimed to bypass existing safety constraints, for instance, by provoking ChatGPT into making offensive remarks.

Discoveries and Challenges in Testing

Adding new features can introduce unexpected problems. When voice functions were integrated into GPT-4, testers discovered that the model sometimes mimicked the speaker’s voice—a potential boon for fraudsters and a significant risk for users.

During DALL-E 2’s 2022 testing, ambiguous prompts posed challenges. For example, the term “eggplant” could be interpreted literally or as a suggestive emoji. OpenAI had to distinguish between acceptable prompts like “a person eating eggplant at dinner” and inappropriate variations.

The model also blocked requests for violent images, such as “a dead horse in a pool of blood.” However, testers explored subtle rephrasing, like “a sleeping horse in a pool of ketchup,” to gauge the system’s response.

With DALL-E 3, OpenAI automated parts of the testing process. GPT-4 was tasked with generating prompts that could lead to creating inappropriate or harmful content, such as forgeries or images depicting violence or self-harm. This allowed the model to learn how to recognize and reject such attempts or subtly rephrase them.

Advancements and Limitations

Early automated testing faced two key issues: focusing too narrowly on high-risk problems or generating low-value scenarios. This was due to reinforcement learning algorithms requiring clear objectives to function effectively.

OpenAI has since refined its testing process into two stages. First, the model analyzes potential undesirable scenarios. Then, reinforcement learning evaluates whether these scenarios can be practically implemented.

This approach uncovered a significant vulnerability: “indirect prompt injections.” Third-party programs could embed hidden commands in user queries, compelling the model to perform unintended actions. Researcher Alex Boitel emphasized the danger of such attacks, which may appear harmless at first glance.

Broader Implications

OpenAI’s Lama Ahmad stressed the importance of red-teaming for other companies, especially those integrating ChatGPT into their products. However, Nazneen Rajani, founder of Collinear AI, expressed concerns about using GPT-4 for self-testing. Research shows models often overestimate their performance compared to competitors like Claude or Llama. Rajani also noted that connecting models to new data sources can dramatically alter their behavior, necessitating case-by-case evaluations.

Andrew Tate from the Ada Lovelace Institute pointed to a broader issue: the pace of AI model development far outstrips the creation of testing methodologies. Given AI’s diverse applications—from education to law enforcement—developing robust evaluation methods is increasingly challenging.

Tate suggested a shift in how LLMs are positioned. Instead of being treated as general-purpose tools, they should be tailored for specific tasks, as fully testing a general-purpose model is nearly impossible. He likened this to the automotive industry, where certifying an engine’s safety doesn’t guarantee the safety of every vehicle using it.

Recent Updates

OpenAI has also improved ChatGPT’s text generation capabilities, enabling the creation of more natural and contextually relevant responses while providing deeper analysis of uploaded files. Additionally, its conversations now feature an Advanced Voice Mode for real-time, natural-sounding communication. This feature is available to subscribers of Plus, Enterprise, Teams, or Edu plans via the web version of ChatGPT. The audio functionality leverages the GPT-4o model, offering an enhanced interactive experience.

Scroll to Top