On Thursday, OpenAI researchers unveiled CriticGPT, a new AI model designed to identify bugs in the code generated by ChatGPT. It aims to improve the process of making AI systems behave the way humans want (called “alignment”) through reinforcement learning from human feedback (RLHF), which helps human reviewers make the outputs of large language models (LLMs) more accurate.
As described in a new research paper called “LLM Critics Help Catch LLM Bugs,” OpenAI created CriticGPT to act as an AI assistant for human trainers reviewing the programming code generated by the ChatGPT artificial intelligence assistant. CriticGPT, based on the GPT-4 family of LLMS, analyzes code and flags potential errors, making it easier for humans to spot errors that might otherwise go unnoticed. The researchers trained CriticGPT on a data set of code samples with intentionally inserted errors, teaching it to recognize and flag various coding errors.
The development of CriticGPT involved training the model with a large number of inputs containing deliberately inserted errors. Human trainers were asked to modify the code written by ChatGPT, introducing errors and then providing example feedback as if they had discovered these errors. This process allowed the model to learn to identify and critique various types of coding errors.
In experiments, CriticGPT demonstrated its ability to detect both embedded errors and naturally occurring errors in ChatGPT output. Coaches preferred critiques from the new model to those generated by ChatGPT itself in 63 percent of cases involving natural errors (the aforementioned statistic). This preference was in part because CriticGPT produced fewer useless “fussers” and generated fewer false positives or hallucinated problems.
The researchers also created a new technique they call Force Sampling Beam Search (FSBS). This method helps CriticGPT write more detailed code reviews. It allows researchers to adjust how thorough CriticGPT looks for problems while controlling how often it can invent problems that don’t actually exist. They can adjust this balance based on what they need for different AI training tasks.
Interestingly, researchers discovered that CriticGPT’s capabilities extend beyond mere code review. In their experiments, they applied the model to a subset of ChatGPT training data that had previously been rated as flawless by human annotators. Surprisingly, CriticGPT identified errors in 24 percent of these cases, errors that were later confirmed by human reviewers. OpenAI believes this demonstrates the model’s potential to generalize to non-coding tasks and highlights its ability to detect subtle errors that even careful human evaluation might miss.
Despite its promising results, like all AI models, CriticGPT has limitations. The model was trained on relatively short ChatGPT responses, which may not fully prepare it to evaluate longer, more complex tasks that future AI systems might tackle. Furthermore, while CriticGPT reduces confabulations, it does not eliminate them entirely, and human trainers can still make labeling errors based on these false results.
The research team recognizes that CriticGPT is most effective at identifying errors that can be localized to a specific location within the code. However, real-world errors in AI results can often be spread across multiple parts of a response, presenting a challenge for future iterations of the model.
OpenAI plans to integrate CriticGPT-like models into its RLHF labeling pipeline, providing AI assistance to its trainers. For OpenAI, it’s a step toward developing better tools for evaluating the results of LLM systems that can be difficult for humans to grade without additional support. However, the researchers caution that even with tools like CriticGPT, extremely complex tasks or responses can still prove challenging for human raters, even those assisted by AI.