This blog post introduces our paper "Better Language Models for Code through Self-Improvement" in Findings of Annual Meeting of the 61st Association of Computational Linguistics 2023 (ACL 2023).
1. Introduction
In this paper, we propose a simple augmentation technique to improve performance of code language models on sequence generation tasks. In particular, after fine-tuning a pre-trained language model on a specific sequence generation task, we
Use the fine-tuned model to create an augmented version of training data and then
Continue to fine-tune this model on the augmented dataset.
Our empirical evaluation results show that our framework significantly improves the state-of-the-arts PLMCs, including CodeBERT, CodeT5, UniXCoder with significant margins.
2. Methodology
The overall training pipeline as well as process of data augmentation are depicted by the below figures.


Usually, models are pre-trained on large scale corpora, resulting in a pre-trained checkpoint $\theta_{pre - trained}$. These pre-trained models are then fine-tuned on a specific downstream dataset $D$ using a supervised-learning approach, resulting in a set of fine-tuned parameters $\theta_{fine - tuned}$. Our investigation revealed that model performance can be further improved if we continue to fine-tuned these parameters on an augmented version of $D$.
As depicted in Figure 1, our proposal for self-improvement is the final step in the overall training flow. Specifically, we propose a data augmentation process and an extra fine-tuning step in addition to the pre-training and fine-tuning paradigm. The process of augmenting the dataset is illustrated in Figure 2.
For each training pair of sequences $(x_i, y_i)$ in the train dataset $D$, we first use beam search to generate a list of K-best predictions $L_{K}$. This list contains $k$ predictions, where $k$ is the beam size.
We then evaluate the similarity of each prediction $\hat{y}{ij}$ and its corresponding ground truth sequence $y_i$ using a similarity function $sim$ based on BLEU score. The best prediction with highest similarity is then selected $\tilde{y}i = argmax{\hat{y}{ij} \in \mathcal{L}K}(sim(\hat{y}{ij}, y_i))$. In the last step, we add the pair of sequences $(x_i,\tilde{y}_i)$ into a new empty dataset $\tilde{D}$. We call this new dataset the augmented dataset or pseudo dataset interchangeably.
The next step requires fine-tuning $\theta_{fine-tuned}$ on $\tilde{D}$ until convergence. We call this new stage of model parameters $\theta_{improved}$.
3. Intuition
This section explains the intuition behind the method. Alternatively, we can view the method as a bootstrapping approach, where a model learns from its own predictions.
In the auto-regressive generation process, greedy decoding is a commonly used algorithm. However, using a small beam size during the search often leads to sub-optimal, less probable sequences. On the other hand, a larger beam size is more effective in maximizing likelihood. Based on these observations, we make two assumptions:
1. Higher probability indicates a more likely and preferred sequence.
2. Greedy search and large-beam-size search sequences are somewhat close in the model's sequence space. By adjusting the model parameters, we can potentially guide its distribution towards sequences generated with a large beam size (which have a high probability).
This raises the question: Why don't we directly optimize our model based on the probable sequences it generates?
However, there is a major issue with this approach. Since our model is an approximation of the true sequence distribution, there is no guarantee that sequence probability will perfectly align with our preferences.
To address this, we aim to optimize the model not only around the high probability region but also in the vicinity of our preferences. We achieve this by augmenting the training dataset and optimizing the model accordingly, as explained in the Methodology section.
The effectiveness of our method is demonstrated in the experimental results presented in the following section.
4. Results
The results of our code summarization task are presented in Table 1. On average, we observed an average of 0.76 BLUE score increase in performance across all languages. This improvement was consistent across various beam sizes (1, 5, 10), which confirms the effectiveness of our self-improved approach across a wide range of PLMCs. When comparing our model to other strong baselines, we found that our method improved the performance of CodeBERT for JavaScript from 15.78 to 16.39, surpassing the performance of PolyglotCodeBERT (15.80). This highlights the benefit of our self-improved method in improving weak models.

The results of our code generation study are presented in Table 2, the performance increase by 0.81 BLUE scores on average. When using EM and CodeBLEU, the improvement also increases consistently.

Leading results on Code summarization of CodeXGLEU benchmark
In April 2022, our model successfully achieved the 1st position in the code summarization track within the CodeXGLUE benchmark , a widely recognized and comprehensive code-text evaluation established by Microsoft and widely embraced by the research community.

5. Conclusion
This blog post gives a brief summary on our paper Better Language Model of Code through Self-Improvement. Our paper proposed a bootstrapping method and empirically proved its effectiveness through significantly improving performance of popular pre-trained code models. For a more in-depth understanding, we invite you to explore our comprehensive paper