Towards models that can describe code

We've built a language model called Gemini that can read code and describe its functionality in natural language.  Our model is capable in a variety of programming languages and it is multilingual. In this blog, we go into detail its capabilities, as well as the potential impacts.

January 2, 2022

Project Gemini (17).png

We've built a language model called Gemini that was trained on roughly 1 million code-description pairs. The goal of this project was simple: can we enable machines with the ability to effectively describe code the same way a human could? While a tall task, recent work in language models has made rapid advancements, making previously difficult tasks much more plausible.

However, building language models for computer code comes with more challenges than traditional language models. First, data reliability is quite variable and not as abundant. Typically, publicly available datasets contain medium-quality code samples but fall short when containing quality descriptions. This can negatively skew a model's performance early on which is hard to repair further down the development cycle. Second, achieving high-performing language models on any task has become increasingly more expensive. It's been well documented that the larger a language model is in trainable parameters, the better it will perform; However, this strategy comes with heavier monetary and compute costs.

Another major challenge when building a language model is the reliability of the system overall. Recent work from companies like OpenAI, Google and Facebook has shown that language models, particularly in the domain of language generation, can produce unexpected results. Controlling generated outputs is still an open research problem that largely has yet to be solved. 

With all of the challenges mentioned, the questions became clearer to build a successful system for describing code:

1. Can we curate enough data samples that contain quality code descriptions with the original code?
2. How can we build a smaller language model that maximizes consistency and fidelity at the same time?

Building a model for describing code


Model
To address the questions mentioned above, Gemini uses the popularized transformer architecture. We found that training with any other type of architecture (i.e: RNN or GRU's) did not yield the same quality that a transformer would. 

First, a major focus of building a system for describing code was adding consistency to generated outputs. Gemini uses an encoder-decoder structure, oppose to a decoder-only structure like GPT-3. We found that using an encoder-decoder structure improved the consistency of the generated descriptions dramatically compared to using only decoders. 

Gemini consists of 3 different model sizes. The main one consists of 600 million parameters, a smaller version "Lambda" with 230 million parameters and "Plato" at 64 million parameters. This may come as a  surprise to some readers given the recent scale of projects like GPT-3, which use up to 175 billion parameters; However, we found that more parameters only yielded better performance when coupled with fine-tuning. 


Dataset
Second, finding quality data was a major challenge. Our goal with Gemini is to provide natural summaries of code; Since there are no publicly available datasets of code samples a natural corresponding description, we had to synthetically generate samples. We started by first pre-training Gemini on 885K publicly available samples of code-docstring pairs for 7 epochs. This gave Gemini a good base for understanding the semantics of code when translating to English.

To achieve naturalness, we manually prepared 250 handwritten samples with their corresponding code snippet. These new data points were used to fine-tune a new version of Gemini for 2 epochs, which gave it the ability to write naturally. Our intention at this point was to let Gemini re-annotate our original dataset, however, it could only write about 3 in 10 descriptions at a satisfactory level. 

With human assistance, Gemini was able to synthetically generate 700 descriptions, allowing us to fine-tune the model again. This time, we found that Gemini could describe code much clearer. We repeated this cycle until Gemini produced 10,000 samples for Python, Javascript, Go, PHP, Java and Ruby. 

Screenshot 2021-12-30 5.12.02 PM.png

The figure on the left/above is a visual example of our synthetic data generation strategy, which is essential to reaching Gemini's performance. Quality synthetic data generation is also important for maintaining the security of the user's code. Because we have a robust data generation pipeline, using user's code in our models is not necessary. 

We found models only needed to be fine-tuned for a maximum of 3 epochs, regardless of model size. In the future, we hope to require only pre-training or fine-tuning for new tasks.  

Samples

The samples displayed on the right/below are taken from our initial internal testing of all three of our models. While these samples are handpicked by us, we think they accurately represent our model's abilities. To improve performance, we utilize a "best of" search method when generating outputs. We find that all models perform best when picking from the top 3 or 4 best results. Additionally, we found setting a repetition penalty of 2.5 to 3 yielded the best results when attempting to maximize coherency. 

Note: n denotes the "best of" number (eg. take the best solution out of 3 generated outputs) and p denotes repetition penalty.

Gemini (602 Million parameters)
n = 3  p = 3.0

Input code

import requests
import json

def post_url(endpoint, json_data):
    r = requests.post(endpoint, data=json.dumps(json_data))
    return r.text

Generated description:

This program sends a POST to a given endpoint and returns the JSON server response.

Experiments

While building Gemini, we had many questions on what would affect performance the most. We consider a high-performing model as one that can write coherent, sensible outputs that cover the scope of the code. Prior work in this domain has shown that larger language models (in terms of parameters) significantly improve performance, so we trained a variety of model sizes.

Our biggest questions were the following:

 

  • How would model size affect generated outputs?

  • Does pre-training + fine-tuning outperform strict pre-training?

  • How well does each model generalize if only pre-trained and fine-tuned on one programming language? (i.e, can a model only trained on Python generalize to Javascript, Java etc.)

  • Can we reduce the number of training epochs needed?​

Findings

In our experiments, we found that purely increasing the number of trainable parameters does not correlate with an increase in a significant performance boost. Instead, we find that pre-training and fine-tuning yield the best performance, which has been well documented in prior work.

To our surprise, we found only fine-tuning on one programming language is competitive with fine-tuning on multiple programming languages. Our theory is that Gemini pays attention to the semantics of code (i.e variable names, function names, library names etc.) rather than pure memorization. This leads us to believe that a larger model that is pre-trained and fine-tuned on a wide scope of one programming language might surpass the current version of Gemini substantially. 

Larger models had an advantage in the number of training epochs needed. Generally speaking, Gemini only needed about 5 pre-training epochs and 1 to 2 fine-tuning epochs to achieve quality outputs. 

Input code

import * as tf from '@tensorflow/tfjs';

 

export async function loadModel() {

   const model = await tf.loadModel('/model/model.json');

   return model;

}

 

export async function startServer() {

    const model = await loadModel();

    const server = http.createServer(async (req, res) => {

 

    if (req.method === 'POST') {

      const body = await req.json();

      const input = tf.tensor2d(body.input, [1, body.input.length]);

      const output = model.predict(input);

      res.json(output.dataSync());

    }

  });

  server.listen(8080);

}

Gemini (602 Million parameters, pre-trained + fine-tuned)

This program is used to create a server for the model. It first loads the model and then it starts a listener on the server.

Gemini (602 Million parameters, pre-training without fine-tuning)

Loads a TensorFlow model and starts server for requests.

Gemini (1.4 Billion parameters, pre-training without fine-tuning)

This program does the following: it first loads the model. Next, it starts a server listener. Last, it starts a server for requests.

Gemini (662 Million parameters, pre-training on multiple languages, fine-tuning on only Python)

This function is used to start a server for the model. It first loads the model and then it starts the server.

Area's for improvement

During our experimentation, we found some common shortcomings of Gemini and the smaller versions. Most of the shortcomings tend to follow prior work in the language modeling space, which was expected at some capacity. Some of the shortcomings are the following:
 

Outputs are occasionally too concise
When posed with longer inputs, Gemini will summarize the description in 7 words or less. Generally, this wouldn't be an issue, but we find these descriptions are missing clarity or not covering the full scope of the input. 

Descriptions will have high repeatability in certain scenarios 

A common issue in language modeling is repeated words or sentences. We find Gemini will engage in this behavior when posed with mixed context (i.e: inputs that have different types of code mixed together).

Gemini is computationally expensive to train

Pre-training Gemini for 5 to 7 epochs requires one full day using an Nvidia A100 GPU. This can be very time-consuming and expensive for a startup to consistently do.

Garbage in still equals garbage out

While this isn't surprising behavior, Gemini cannot describe code well if it's poorly written. We hope we can bridge this gap with new methods or more data down the road.

Gemini's context window is a limitation

In its current version, Gemini can only accept 512 tokens as input. This can be limiting in certain scenarios where summarizing full-length programs.

What's next for Gemini?

We believe Gemini could be the start of a highly impactful technology, to the point of a potentially major shift in how software is developed. Software developers generally enjoy writing code and building new technologies; Many parts of software development detract from writing code however, like documentation, review and bug fixing. A system like Gemini could be used in a variety of ways to automate the non-enjoyable parts of writing software and allow more time to be focused in better ways.

We are incredibly excited to see where Gemini could be used. In the near future, we foresee Gemini being used in the following ways:

 

  • Faster onboarding for software teams of any size (i.e. increasing the speed of codebase understanding)

  • Automation of internal software practices (ex. code review, bug-fixing, documentation, etc.)

  • Breaking language barriers for multilingual/remote teams

  • Educational institutions (Increasing the speed that students can understand code)

  • Open Source projects (removing ambiguity from functionality)

 

However, our vision for Gemini is much bigger than a simple explanation of code. In the future, we want to expand Gemini's capabilities to:

 

  • Write full-length documents, such as README's or reports about code functionality

  • Question answering about code

  • Creating different levels of explanation (ex. a code description for an experienced developer might be different from one for a high-school student)​​

  • Writing examples of how code could be used in practice (i.e: generating function call examples)

From those examples, one can begin to imagine the wide scope of a general code in-text out like Gemini.

Release philosophy

Ultimately, our goal is to distribute the power of strong Machine Learning systems to as many individuals, groups and organizations as possible. Because systems like Gemini are currently expensive to develop, we are releasing it as a paid product. We will allow anyone to access Gemini (and other versions) through:

  • A fully managed, no-code product

  • Developer API

  • Product licensing

  • Partnerships


All of which will be used to fund the next generation of our engines. In years to come, we plan to make various versions of Gemini available via open-source, as we strongly believe in allowing anyone to use our technology. However, timelines are not guaranteed.