ChatGPT

First, the tech behind ChatGPT isn’t new. It’s based on “GPT-3.5” an upgraded version of GPT-3 which became available to the public many months ago. Yet. not much was being built around it until now.

The first step to create ChatGPT was to adjust GPT-3.5 for conversations. They literally had human AI trainers provide conversations in which they played both sides—the user and an AI assistant. In other words, they paid people to chit-chat.

With a model capable of generating answers similar to humans, they needed a way to tell the AI what was a good/bad answer. To solve that, they used humans (again) to rank randomly selected answers that ChatGPT was spitting out from best to worst.

The rank was then used to train a second model they called the "reward" model. So there are two models: 1. a model that can answer questions like a human. 2. a model that can say how good/bad the answers was. The last step is brilliant.

The last step was to train Reinforcement Learning model which is similar to dog training where a reward is given for a "good" behavior, So what was the "reward" here? Spoiler: Not a cookie. They used the score as reward to train the model.

The recipe: 1. Have a model generate a human-like answer. 2. Have a model score that answer. 3. Have model learn from the score and re-adjust the answer until it gets an A+. 4. Repeat a million times until accurate

Going deeper into chatGPT

ChatGPT has been one of the most popular artificial intelligence(AI) agents ever created. The model has taken the data science community and the internet by storm pushing the boundaries of creativity across all industries. Despite the immense popularity of ChatGPT, there have been very little discussion about the AI techniques behind its magic. Many of the techniques behind ChatGPT are going to be the foundation of the upcoming GPT-4 which promises to be one of the most impressive models in AI history.

The main ideas behind ChatGPT were pioneered by another OpenAI’s , InstructGPT which was released earlier this year. InstructGPT fine tunes GPT to follow instructions which opens the door to a wider set of human interactions . ChatGPT takes some of the ideas pioneered by InstructGPT to a whole new level with a very novel architecture and training process.

Inside ChatGPT

Similarly to InstructGPT, the core architecture of ChatGPT relies on a “human-annotated data + reinforcement learning” (RLHF) methods. The main idea of using RLHF is to continuously fine-tine the underlying language model to understand the meaning of human commands. However, ChatGPT includes some differences in the data collection setup by including supervised fine-tuning with human AI trainers for both the user and an AI assistant. The core ChatGPT training process is segmented in three main phases:

Phase 1: Supervised Policy Model

The objective of this first phase is to fine tune GPT-3.5 policy to understand a specific set of user commands. During this phase, users submit different batches of prompts and high quality answers are provided by human labelers. The <prompt, answer> dataset is then used to fine tune GPT-3.5 in order to better understand actions contained in prompts.

Phase 2: Reward Model Training

The goal of the second phase is to train a reward model using annotated training data. During this phase, ChatGPT samples a batch of prompts generated during the first phase and creates a number of different answers for each prompt. Using the <prompt, answer1, answer2,…, answerN>, an annotator orders the answers based on a multidimensional criteria that incudes aspects such as relevance, informativeness, harmfulness and several others. The result dataset is used to train the reward model.

Phase 3: Reinforcement Learning Enhancement

The final phase of the ChatGPT training includes a reinforcement learning(RL) metho to enhance the pretrained model. The RL algorithm uses the reward model of the previous phase to update the parameters of the pretrained model. Specifically, this phases initializes a batch of new commands from the prompts as well as the parameters used by the proximal policy optimization(PPO) model. For each prompt, the PPO model generates answers and RM model provides a score based on its efficiency.

The architecture behind GPT combines language pretrained models with very clever reinforcement learning and supervised fine tuning processes to provide a level of action understanding we haven’t seen before in this type of architectures. One thing that ChatGPT showed us is that, when comes to foundation models, the pretraining and fine-tuning process is as important as the architecture itself.

References:

https://jrodthoughts.medium.com/the-ai-powering-chatgpt-68968d452d79

PreviousGPT is something really wide NextGetting the most from chatGPT

Last updated 2 years ago