Basic concepts on AIGC
  • About the course materials
  • General Course Format and Strategies
  • Introduction
  • Foundations for AIGC
    • Computers and content generation
    • A brief introduction to AI
      • What AI is?
      • What ML is?
      • What DL is?
      • Discriminative AI vs. Generative AI
  • Generative AI
    • Introduction to Generative AI
      • Going deeper into Generative AI models
  • Deep Neural Networks and content generation
    • Image classification
    • Autoencoders
    • GAN: Generative Adversarial networks
    • Transformers
    • Diffusion models
      • Basic foundations of SD
  • Current image generation techniques
    • GANs
  • Current text generation techniques
    • Basic concepts in NLP in Large Language Models (LLMs)
    • How chatGPT works
  • Prompt engineering
    • Prompts for LLM
    • Prompts for image generators
  • Current AI generative tools
    • Image generation tools
      • DALL-E 2
      • Midjourney
        • More experiments with Midjourney
        • Composition and previous pictures
        • Remixing
      • Stable diffusion
        • Dreambooth
        • Fine-tuning stable diffusion
      • Other solutions
      • Good prompts, img2img, inpainting, outpainting, composition
      • A complete range on new possibilities
    • Text generation tools
      • OpenAI GPT
        • GPT is something really wide
      • ChatGPT
        • Getting the most from chatGPT
      • Other transformers: HuggingFace
      • Other solutions
      • Making the most of LLM
        • Basic possibilities
        • Emergent abilities of LLM
    • Video, 3D, sound, and more
    • Current landscape of cutting-edge AI generative tools
  • Use cases
    • Generating code
    • How to create good prompts for image generation
    • How to generate text of quality
      • Summarizing, rephrasing, thesaurus, translating, correcting, learning languages, etc.
      • Creating/solving exams and tests
  • Final topics
    • AI art?
    • Is it possible to detect AI generated content?
    • Plagiarism and copyright
    • Ethics and bias
    • AI generative tools and education
    • The potential impact of AI generative tools on the job market
  • Glossary
    • Glossary of terms
  • References
    • Main references
    • Additional material
Powered by GitBook
On this page
  • Copilot
  • OpenAI Origins
  • Microsoft business plan
  • Exploring the OpenAI ecosystem
  • Models
  • Edit and Insert Features
  • Insert
  • Edit
  • Embeddings
  • GitHub Copilot Service
  • How to make the best out of Copilot
  1. Use cases

Generating code

PreviousCurrent landscape of cutting-edge AI generative toolsNextHow to create good prompts for image generation

Last updated 2 years ago

Copilot

GitHub Copilot has been a hot topic like other AI tools based on GTP-3. Some people love it, and some people loathe it with prophetic action class A lawsuits predictions, and others are just too cheap to pay for it. Regardless of who you are, this is not that type of article.

I will break down how Microsoft built Github Copilot (on an upper level of course) in a strictly technical way, so you get the idea of how it works and how to get the most out of it. All these tools and techniques can be extremely useful for you in the future.

OpenAI Origins

OpenAI was founded in 2015 by Elon Musk. Pledged US$1 billion to the cause, and focused on a Non-profit approach to AGI. A couple of years later, the company became for-profit and Elon moved on, Microsoft came into the scene and added another billion for good measure.

Microsoft business plan

Microsoft did a genius move. The company has been acquiring several other major pieces of the market, like Github. On top of controlling now a license for the OpenAI models, it had an unlimited amount of codebases to explore, public and private.

Microsoft tightly controls this part of the intellectual property, where any OpenAI service on Azure can only be given to major partners. This is how I think the pipeline works:

A flowchart of the major events that created the copilot we know today

This is not anything special, but if you are thinking of investing in the NLP revolution, there are a few ideas that you should not invest in, such as smart office plugins, code conversions, or other Microsoft adjacent tools. Microsoft will always win if they so desire to compete. They just control the golden supply chain of data at this point. But the rest is up for grabs!

Exploring the OpenAI ecosystem

This topic requires an in-depth coverage of the pieces that make the OpenAI models work. Let's see one by one:

Models

Edit and Insert Features

Insert

Edit

Embeddings

Embeddings are a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Or in other words, it's the cache, analogy-wise. It's widely used for semantic search engines nowadays, but here can be used for context for the Copilot.

GitHub Copilot Service

This is all speculation of course, but I think it's close enough. OpenAI makes use of all the tools and features above described.

Inside of their own stack, they have an MLOps, that:

  1. Gets all the code from GitHub, old and new, that compiles.

  2. Passes the code to some sort of code quality tool, to guarantee quality.

  3. Maybe indexes by programming language, and framework, in order to increase the level of generation, per language.

  4. Generates new models with the new data.

How to make the best out of Copilot

To get the best out of him you need to give as much context as possible. This goes for how the code is structured, for example, you should really use Copilot for isolated methods, like repository methods or methods with a defined name and behavior.

As you can see, if you keep the using on top (basically the import reference in C#), he gets it right. As you can see, there is a lot it goes on. Was not always like this, they have been doing the service cleverly. One important thing is token length. The Cushman model is not as capable as Davinci, so it can only take 2048 tokens in length, instead of 4k. The workaround is to use embedding like was mentioned above, but that's up to them when to use what. They are not embedding your entire project, that's for sure.

Having said that, let's keep going, with the implementation:

He can't figure out what to use since he doesn't have the project reference, but actually is not far off, since is pretty close with all the 99% of Azure Storage implementations. Let's help him with some usings.

As you can see, he gets there quite easily. Another clever trick is to keep some more complex codebases (like classes) inside the file so it can get more context for the insert, once it works separate them.

Copilot multiple options

This feature kinda exists already in the OpenAI playground:

Codex is a finetuned version of the normal models. Code-DaVinci is the counterpart of DaVinci and code-cushman can be compared to a counterpart of the curie model. Is faster but less capable feature-wise. Davinci is pretty expensive to run and it's slower, so Cushman is the right model for the job. You can see this in action on this page.

These two have been present for a while but only became available in the first half of 2022, and they are the bedrock of the Github Copilot, without them, you would have the same product. All the similar large models work in completion mode, and you cannot change in the middle. Now in GPT-3, you can change a method or even a piece of code, keeping the complete context of the codebase. You have the documentation .

Insert feature
Edit feature
The upper level of how Copilot is structured
Image with context from Domain
Copilot multiple options
“Best of” in the OpenAI playground
OpenAI Codex
here