Generating code
Last updated
GitHub Copilot has been a hot topic like other AI tools based on GTP-3. Some people love it, and some people loathe it with prophetic action class A lawsuits predictions, and others are just too cheap to pay for it. Regardless of who you are, this is not that type of article.
I will break down how Microsoft built Github Copilot (on an upper level of course) in a strictly technical way, so you get the idea of how it works and how to get the most out of it. All these tools and techniques can be extremely useful for you in the future.
OpenAI was founded in 2015 by Elon Musk. Pledged US$1 billion to the cause, and focused on a Non-profit approach to AGI. A couple of years later, the company became for-profit and Elon moved on, Microsoft came into the scene and added another billion for good measure.
Microsoft did a genius move. The company has been acquiring several other major pieces of the market, like Github. On top of controlling now a license for the OpenAI models, it had an unlimited amount of codebases to explore, public and private.
Microsoft tightly controls this part of the intellectual property, where any OpenAI service on Azure can only be given to major partners. This is how I think the pipeline works:
This is not anything special, but if you are thinking of investing in the NLP revolution, there are a few ideas that you should not invest in, such as smart office plugins, code conversions, or other Microsoft adjacent tools. Microsoft will always win if they so desire to compete. They just control the golden supply chain of data at this point. But the rest is up for grabs!
This topic requires an in-depth coverage of the pieces that make the OpenAI models work. Let's see one by one:
Codex is a finetuned version of the normal models. Code-DaVinci is the counterpart of DaVinci and code-cushman can be compared to a counterpart of the curie model. Is faster but less capable feature-wise. Davinci is pretty expensive to run and it's slower, so Cushman is the right model for the job. You can see this in action on this OpenAI Codex page.
These two have been present for a while but only became available in the first half of 2022, and they are the bedrock of the Github Copilot, without them, you would have the same product. All the similar large models work in completion mode, and you cannot change in the middle. Now in GPT-3, you can change a method or even a piece of code, keeping the complete context of the codebase. You have the documentation here.
Embeddings are a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Or in other words, it's the cache, analogy-wise. It's widely used for semantic search engines nowadays, but here can be used for context for the Copilot.
This is all speculation of course, but I think it's close enough. OpenAI makes use of all the tools and features above described.
Inside of their own stack, they have an MLOps, that:
Gets all the code from GitHub, old and new, that compiles.
Passes the code to some sort of code quality tool, to guarantee quality.
Maybe indexes by programming language, and framework, in order to increase the level of generation, per language.
Generates new models with the new data.
To get the best out of him you need to give as much context as possible. This goes for how the code is structured, for example, you should really use Copilot for isolated methods, like repository methods or methods with a defined name and behavior.
As you can see, if you keep the using on top (basically the import reference in C#), he gets it right. As you can see, there is a lot it goes on. Was not always like this, they have been doing the service cleverly. One important thing is token length. The Cushman model is not as capable as Davinci, so it can only take 2048 tokens in length, instead of 4k. The workaround is to use embedding like was mentioned above, but that's up to them when to use what. They are not embedding your entire project, that's for sure.
Having said that, let's keep going, with the implementation:
He can't figure out what to use since he doesn't have the project reference, but actually is not far off, since is pretty close with all the 99% of Azure Storage implementations. Let's help him with some usings.
As you can see, he gets there quite easily. Another clever trick is to keep some more complex codebases (like classes) inside the file so it can get more context for the insert, once it works separate them.
Copilot multiple options
This feature kinda exists already in the OpenAI playground: