CriticGPT to critique other GPTs

3 min read2 days ago

https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf

Reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. To improve human evaluation ability and overcome that limitation this work trains “critic” models that help humans to more accurately evaluate model-written code. These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks. On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review. We further confirm that our fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as “flawless”, even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model. Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.

Here’s a summary of the key points from the video:

1. OpenAI has introduced a new methodology called “critique GPT” to optimize language models using reinforcement learning with human feedback (RLHF) combined with force sampling beam search.

2. The critique GPT is designed to help human trainers evaluate and improve AI model outputs, particularly in the alignment phase of training.

3. OpenAI acknowledges that typical human intelligence is no longer sufficient to evaluate advanced AI outputs, necessitating PhD-level science knowledge for training.

4. The critique GPT aims to improve the quality and relevance of model outputs while minimizing hallucinations and nitpicking.

5. OpenAI states that they don’t fully trust their models, highlighting the need for ongoing evaluation of model outputs.

6. The critique GPT is trained specifically to help humans optimize their performance in evaluating AI systems, creating a feedback loop where AI helps humans who then help improve AI.

7. Experiments show that critique GPT, when combined with human expertise, outperforms both vanilla GPT and humans alone in certain tasks, such as bug detection in code.

8. OpenAI claims that using critique GPT in the alignment phase is equivalent to 30 times more compute in the pre-training phase, suggesting significant efficiency gains.

9. The document raises questions about whether this approach addresses root causes of AI trustworthiness or merely symptoms.

10. While critique GPT may not be directly applicable for most AI researchers, the concept of developing AI systems to augment human performance could have significant industrial applications.

CriticGPT to critique other GPTs

Written by sbagency