Basic concepts on AIGC
  • About the course materials
  • General Course Format and Strategies
  • Introduction
  • Foundations for AIGC
    • Computers and content generation
    • A brief introduction to AI
      • What AI is?
      • What ML is?
      • What DL is?
      • Discriminative AI vs. Generative AI
  • Generative AI
    • Introduction to Generative AI
      • Going deeper into Generative AI models
  • Deep Neural Networks and content generation
    • Image classification
    • Autoencoders
    • GAN: Generative Adversarial networks
    • Transformers
    • Diffusion models
      • Basic foundations of SD
  • Current image generation techniques
    • GANs
  • Current text generation techniques
    • Basic concepts in NLP in Large Language Models (LLMs)
    • How chatGPT works
  • Prompt engineering
    • Prompts for LLM
    • Prompts for image generators
  • Current AI generative tools
    • Image generation tools
      • DALL-E 2
      • Midjourney
        • More experiments with Midjourney
        • Composition and previous pictures
        • Remixing
      • Stable diffusion
        • Dreambooth
        • Fine-tuning stable diffusion
      • Other solutions
      • Good prompts, img2img, inpainting, outpainting, composition
      • A complete range on new possibilities
    • Text generation tools
      • OpenAI GPT
        • GPT is something really wide
      • ChatGPT
        • Getting the most from chatGPT
      • Other transformers: HuggingFace
      • Other solutions
      • Making the most of LLM
        • Basic possibilities
        • Emergent abilities of LLM
    • Video, 3D, sound, and more
    • Current landscape of cutting-edge AI generative tools
  • Use cases
    • Generating code
    • How to create good prompts for image generation
    • How to generate text of quality
      • Summarizing, rephrasing, thesaurus, translating, correcting, learning languages, etc.
      • Creating/solving exams and tests
  • Final topics
    • AI art?
    • Is it possible to detect AI generated content?
    • Plagiarism and copyright
    • Ethics and bias
    • AI generative tools and education
    • The potential impact of AI generative tools on the job market
  • Glossary
    • Glossary of terms
  • References
    • Main references
    • Additional material
Powered by GitBook
On this page
  • Understanding RLHF
  • RLHF and ChatGPT
  1. Current text generation techniques

How chatGPT works

PreviousBasic concepts in NLP in Large Language Models (LLMs)NextPrompt engineering

Last updated 2 years ago

Understanding RLHF

RLHF was initially unveiled in , a research paper published by OpenAI in 2017. The key to the technique is to operate in RL environments in which the task at hand is hard to specify. In these scenarios, human feedback could make a huge difference.

RLHF utilizes small amounts of feedback from a human evaluator to guide the agent’s understanding of the goal and its corresponding reward function. The training process is a three-step feedback cycle. The AI agent starts by randomly acting in the environment. Periodically, the agent presents two video clips of its behavior to the human evaluator, who then decides which clip is closest to fulfilling the goal of a backflip. The agent then uses this feedback to gradually build a model of the goal and the reward function that best explains the human’s judgments. Once the agent has a clear understanding of the goal and the corresponding reward function, it uses RL to learn how to achieve that goal. As its behavior improves, it continues to ask for human feedback on trajectory pairs where it’s most uncertain about which is better, further refining its understanding of the goal.

The initial applications for RLHF were in areas like robotics but nobody could have predicted that it was going to become a key building block on the most famous AI model in history.

RLHF and ChatGPT

The isde of using RLHF in ChatGPT was pioneered by a previous model: InstructGPT. In the case of InstructGPT, The process begins by collecting a dataset of human-written demonstrations on prompts submitted to their API, which is then used to train their supervised learning baselines. Next, a dataset of human-labeled comparisons between two model outputs on a larger set of API prompts is gathered. A reward model (RM) is trained on this dataset to predict which output the labelers would prefer. Finally, the RM is used as a reward function, and the GPT-3 policy is fine-tuned to maximize this reward using the proximal policy optimization(PPO) algorithm.

This approach is thought of as “unlocking” capabilities that GPT-3 already possessed but were difficult to elicit through prompt engineering alone. This is due to the limited ability of the training procedure to teach the model new capabilities relative to what was learned during pretraining, as it uses less than 2% of the compute and data relative to model pretraining.

A limitation of this method is the introduction of an “alignment tax”, in which aligning the models only on customer tasks can lead to worse performance on other academic NLP tasks. To mitigate this, the team implemented an algorithmic change during RL fine-tuning, in which a small fraction of the original data used to train GPT-3 is mixed in and trained using the normal log-likelihood maximization. This approach effectively maintains performance on safety and human preferences while also mitigating performance decreases on academic tasks, and in some cases, even surpassing the GPT-3 baseline.

In the case of ChatGPT, OpenAI followed a similar approach to InstructGPT with a small caveat in the data collection setup. In the case of ChatGPT, human AI trainers input conversations in which they played both sides: the user and the AI assistant. The trainers have access to the suggestions produced by the model in order to generate their responses. The result of that process is composed of the InstructGPT dataset.

For the reward RL model in ChatGPT, OpenAI collected responses from different models in conversations with AI trainers and ranked them by quality. The rewards models allow the use of Proximal Policy Optimization over several iterations.

Deep reinforcement learning from human preferences