We introduce DocChecker, a Python package that was trained for automated detection to identify inconsistencies between code and docstrings, and generate comprehensive summary sentences to replace the old ones.
What kinds of problems can the DocChecker tool solve?
Artificial Intelligence for Software Engineering (AI4SE) has seen significant growth in recent years with the development of machine learning (ML) models, that can learn from large and diverse datasets. These datasets used to train and evaluate the ML models are often collected from open-sourced projects and contain both source code and natural language text, such as code comments. However, quality issues usually occur in these datasets, which can significantly affect the performance of the models, leading to incorrect or inaccurate predictions. One important quality issue is the inconsistency between source code and comments, when the code has been changed but the comments have not been updated to reflect the updated code, or because the comments do not accurately describe the code in the first place. As such, there is a need to detect and potentially fix the inconsistency between code and its corresponding comment in large-scale AI4SE datasets.
What is DocChecker?
DocChecker is a pre-trained language model of code that is designed specifically for the Inconsistency Code-Comment Detection (ICCD) task. It is jointly pre-trained with three objectives: code-text contrastive learning, binary classification, and text generation. To the best of our knowledge, DocChecker is the first in AI4SE, which can both filter the inconsistencies between the code and comments, and generate a new meaningful summary sentence to replace the original ones.
An Overview of DocChecker architecture
The above figure is an overview of our architecture. It is built on top of an encoder-decoder model to learn from code-text pairs. We pre-train DocChecker using three objectives: code-text contrastive learning, binary classification, and generation-based objective. The understanding-based objectives help the model learn information and semantics from the input, and then predict the inconsistency, while the generation-based objective is optimized to convert the code information into coherent comments.
Code-Text Contrastive Learning (CTC)
It aligns the feature spaces of the code encoder and the text encoder by encouraging positive code-text pairs to have similar representations as opposed to negative pairs. We create the negative samples by choosing the hard negative code-text pairs based on contrastive similarity. One negative text is selected from the same batch for each code function in a mini-batch, with texts that are more similar to the code function having a larger likelihood of selection. Similarly, for each text, we sample one hard negative code function.
Binary Classification (BC)
This objective can capture the alignment between code and text. Given a batch of matched or mismatched code-text pairs, the model needs to identify which code functions and comments correspond to each other. We use the average of the text encoder’s output as the joint representation of the code-text pair and append a fully-connected layer followed by softmax to predict a two-class probability. Using this objective, we can detect if the code-text pair is inconsistent (mismatch) or consistent (match).
Text Generation (TG)
Text generation aims to generate a summary sentence given the code snippet. It optimizes a cross-entropy loss, which trains DocChecker to maximize the likelihood of the text in an autoregressive manner. The text decoder replaces the bi-directional self-attention layers with causal self-attention layers so that each text token can attend to its previous text tokens and the queries from the code encoder.
While utilizing the benefit of multi-task learning, we shared the entire weight of the text encoder and text decoder. The text representation is well-learned with contextual and general semantics. On the other hand, we also use separate, fully-connected layers that can capture the differences and reduce the interference between these tasks.
Performance Evaluation
Evaluation results
We selected the existing works to compare against DocChecker for its effectiveness on the ICCD task. We use the Just-In-Time dataset for this task when the purpose of this dataset is to determine whether the comment is semantically out of sync with the corresponding code snippet.
We find out that our method can significantly outperform all of the baselines. Although CodeBERT and CodeT5 are two pre-trained models, which have more parameters and show efficiency in many downstream tasks, their performance is also lower than ours. The experiment result suggests that DocChecker benefits from the use of a pre-trained deep learning model with the three objectives above. It also supports the fact that our method is effective at detecting inconsistent samples in code corpus.
Practical Application
Besides showing the performance of DocChecker on the Just-In-Time dataset, we also demonstrate its effectiveness in real-world cases. We consider the popular CodeSearchNet dataset, which extracts functions and their paired comments from Github repositories. Although this benchmark dataset is expected to be of good quality, noise is inevitable due to the differences in coding conventions and assumptions employed in modern programming languages and IDEs. By using DocChecker, we can filter the inconsistent code-comment samples and generate a new comprehensive summary sentence for them.
The above figures are some examples of the inconsistent sample that DocChecker detected from the CodeSearchNet dataset. Its comment is not aligned with the corresponding code snippet and needs to be updated. Besides detecting, DocChecker also provides a comprehensive summary sentence for each sample to replace the old ones.
How to use the DocChecker tool?
Since DocChecker is a Python package, you can easily use its inference function.
!pip install docchecker
from DocChecker.utils import inference
Parameters:
input_file_path (str): the file path that contains source code, if you want to check all the functions in there.
raw_code (str): a sequence of source code if input_file_path is not given.
language (str, required): the programming language. We support 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP.
output_file_path (str): if input_file_path is given, the results from our tool will be written inoutput_file_path; otherwise, they will be printed on the screen.
Returns:
Below is an example of this tool:
We also provide a UI for better user interaction
here Explore More
We also provide a UI for better user interaction We invite you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our mailing list to get regular updates on this and other research projects.
Paper: Learn more about DocChecker by reading our
research paper. GitHub: Check out our code and toolkit
here. Feedback/Questions: Contact us at
[email protected].
Acknowledgments
This work is partly funded by FPT Software AI Center. More information can be found at
https://www.fpt-aicenter.com/about/References
5.Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547.