March 04, 2023
Tags: 2023 · codelms · analysis · benchmarking · lm
Code Language Models are used in the context of code completion and chat interfaces day in day out by software developers to write and contribute to code bases. Programming Languages is a formal in nature with a fixed, strict set of rules. This pushes us to study quanitifying and measuring how diverse are the generations generated by these models in various settings, more importantly it makes us wonder what diversity is in the context of code.
This post aims to explore and understand the diversity of generations of Code LMs. And discover various shortcomings in measuring diversity of generations of Code LMs in a controlled settings. In the below post I use HumanEval benchmark which is a benchmark evaluating function level code completion by CodeLMs for functional correctness measuring the accuracy, passing the given test cases within first k out of n samples. This post widely tries to explore how to evaluate diversity in the context of code and shortcomings of the current methods to measure diversity.
The general notion of syntax and semantics in the context of programming languages is slightly different from that of natural language. In the context of programming languages syntax refers to the structure of the code, the rules that govern the structure of the code. Semantics refers to the intention of a given piece of code. Here is a small example of semantically similar code snippets which can be expressed varied syntax which are functionally equivalent.
The formal structure of the code allows for a wide range of syntactic variations. Diversity of a CodeLM in the context of Code can be thus formally defined as functionally equivalent, semantically similar generations. While the idea of diversity might be very small in small snippets of code, more abstract real world software.
HumanEval benchmark introduced in the Codex is a popular benchmark widely used to measure the performance of CodeLMs. It is a benchmark evaluating function level code completion by CodeLMs for functional correctness measuring the accuracy, passing the given test cases within first k out of n samples.
The coherent way to quantify diversity would be to look at the latent representations of different completions given a prompt. This can be done by using a good embedding model and doing some cluster analysis in the latent representation of different completion given. While this method has it's shortcomings, it is a good starting point to quantify diversity. But the ideal end goal would be to measure how semantically diverse and functionally equivalent the generations are given a singular prompt.
Unlike natural language wherein, the semantics of a sentence can be understood by looking at the words and their order, in the context of programming languages, the semantics of a code snippet is not just the syntactical elements and their order. The semantics of a code snippet is also dependent on the structure of the code, the states used, the functions called, the libraries imported, etc. This makes it difficult to model the semantics of a code snippet using just the lexical tokens and their order. And this makes the model heavily biased to syntax rather than actual semantic similarity. For instance, consider three functions:
def func_a(l):
return max(l)
def func_b(l):
max_element = l[0]
for i in l:
if i > max_element:
max_element = i
return max_element
def func_c(l):
max_element = l[0]
return max_element
func_a and func_b are semantically similar functionally trying to find the maximum element in a list. But func_c is actually bugged though naming conventions of the variable tends to be make the model map the latents to be similar. Though the actual semantically similar functional equivalent of func_a is func_b. The model tend to be severely biased towards the syntax rather than the actual semantics of the code.
For the sake of controlled experiment, we take the following models and do the analysis on the following models: