Kahneman-Tversky Optimization (KTO) // align LLMs with human feedback by directly maximizing utility based on prospect theory

sbagency
6 min readJan 7, 2024

--

https://contextual.ai/better-cheaper-faster-llm-alignment-with-kto/

Today, we’re releasing a method called Kahneman-Tversky Optimization (KTO) that makes it easier and cheaper than ever before to align LLMs on your data without compromising performance. The success of LLMs has been driven in no small part by alignment with human feedback. If ChatGPT has ever refused to answer your question, it’s likely because it has been trained to avoid saying something controversial. However, it has historically been difficult for companies to align their own LLMs. We’re excited to introduce the KTO method for improving the overall performance and quality of LLMs, while also saving on costs.

Contextual AI introduces Kahneman-Tversky Optimization (KTO), a cost-effective method to align Large Language Models (LLMs) with human feedback, enhancing performance without relying on preference pairs. Historically, LLM alignment has been challenging due to complexities in existing methods like Reinforcement Learning with Human Feedback (RLHF) and the high cost of obtaining preference data. KTO overcomes these challenges by leveraging singleton feedback — identifying whether an output is desirable or undesirable for a given input. The approach, inspired by Kahneman & Tversky’s work on human decision-making, outperforms Direct Preference Optimization (DPO) and standard fine-tuning, offering a significant boost in performance. Contextual AI also releases Archangel, a suite of 56 human-feedback aligned language models, enabling researchers to explore the impact of alignment on different scales and methodologies. The open-source repository for KTO facilitates easy implementation and customization, while models aligned with KTO are available on Huggingface.

https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf

From Kahneman & Tversky’s seminal work on prospect theory (1992), we know that humans perceive random variables in a systematically distorted manner; for example, they are more sensitive to losses than gains of the same magnitude. We show that existing methods for aligning LLMs with human feedback implicitly model some of these distortions, making them human-centered loss functions (HALOs). However, the utility functions these methods impute to humans still differ in some ways from those in the prospect theory literature. By bridging this gap, we derive a HALO that directly maximizes the utility of LLM generations instead of maximizing the log-likelihood of preferences, as current methods do. We call our approach Kahneman-Tversky Optimization (KTO). KTO matches or exceeds the performance of direct preference optimization methods at scales from 1B to 30B. Moreover, because KTO does not need preference pairs — only knowledge of whether an output is desirable or undesirable for a given input — it is much easier to deploy in the real world, where the latter kind of data is far more abundant

https://github.com/ContextualAI/HALOs
https://github.com/eric-mitchell/direct-preference-optimization
https://arxiv.org/pdf/2305.18290.pdf

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

https://arxiv.org/pdf/2401.01335v1.pdf

Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional humanannotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM’s performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.

LLMs alignment can be done by other models

https://arxiv.org/pdf/2305.13735.pdf

Aligning large language models (LLMs) to human values has become increasingly important as it enables sophisticated steering of LLMs. However, it requires significant human demonstrations and feedback or distillation from proprietary LLMs such as ChatGPT. In this work, we propose a novel alignment learning framework with synthetic feedback not dependent on extensive human annotations and proprietary LLMs. First, we perform reward modeling (RM) with synthetic feedback by contrasting responses from vanilla LLMs with various sizes and prompts. Then, we use the RM to simulate high-quality demonstrations to train a supervised policy and further optimize the model with reinforcement learning. Our resulting model, Aligned Language Model with Synthetic Training dataset (ALMoST), outperforms recent open-sourced models, which are trained on the outputs of InstructGPT or humanannotated demonstrations, in alignment benchmarks. In human evaluation, our model is preferred to Alpaca and Dolly-v2, 55.0% and 58.5% of the time, respectively. Further analyses demonstrate the efficacy and importance of synthetic feedback in our framework 1 .

https://openai.com/blog/introducing-superalignment

Superintelligence will be the most impactful technology humanity has ever invented, and could help us solve many of the world’s most important problems. But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction.

Lol: Superintelligence, superalignment, super…

https://openai.com/blog/our-approach-to-alignment-research

Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned. Using scientific experiments, we study how alignment techniques scale and where they will break.

--

--

sbagency
sbagency

Written by sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet