Large Language Models are Few-shot Testers

In this post we’ll review the paper Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction.

‍

Background

Writing test cases for bugs is a critical yet tedious part of software development. How much developer time is spent on this? The authors analyze 300 open source Java projects and find that on average 28% of tests were added as part of a bug fix.

Most existing automated test generation tools focus on maximizing code coverage rather than reproducing issues described in natural language ^[1]. There has been some progress on reproducing crashes from stack traces ^[2], but not general bugs. This means developers still manually write most bug reproducing tests.

To automate this tedious process, researchers from KAIST proposed LIBRO, which uses Large Language Models (LLMs) to generate tests from bug reports ^[3].

How LIBRO Works

LIBRO uses the Codex model to generate tests based on bug reports. The core idea is to construct a one-shot prompt summarizing the bug report and asking the LLM to provide a reproducing test. Codex is queried multiple times to get a set of candidate tests. The raw LLM outputs are processed to make them into executable Java tests. Finally, LIBRO ranks the generated tests and injects the top ranking test into the most appropriate test suite file so that dependencies are available to the test function.

We can say that a test successfully reproduces the bug if it fails on the commit where the bug is reported, and passes on the commit where the bug is fixed.

The process is tested on Defects4J (a dataset of bug reports and fixes for Java projects) achieving the following the results:

LIBRO successfully generated tests for 251 out of 750 bugs (33.5%)

It improved over EvoCrash, a state-of-the-art crash reproduction tool that replicated only 69 bugs ^[4]

Simply copying code snippets from bug reports reproduced 36 bugs.

It’s selection technique eliminated 188 invalid tests while retaining 87% of true reproducing tests

LIBRO required under 10 minutes per report on average — this depends on how many candidate tests we ask the LLM to produce. N=50 in the paper. Results seem to improve as N increases.

Ranking algorithm prioritized reproducing tests within the top 5 results for 80% of reproduced bugs

Interpretation

These results are promising but with a 33% success rate, not quite ready for prime time yet. Bear in mind that the codex model used in the paper is quite old now and more recent models would almost definitely perform far better, as well as providing a larger context window.

One concern with this study is training data contamination — the Defects4J dataset that is being used for evaluation may have been used in the training of the Codex model. The authors acknowledge this risk and in order to mitigate it they also evaluated LIBRO on 31 recent issues. Impressively, it reproduced 32% of these “wild” bugs, and its selection/ranking continued to be accurate. Let’s be fair, this is a pretty small dataset though.

We manually inspected the Defects4J dataset and found two things noteworthy. First, the bug reports are very technical in nature, unlike those that would be submitted by an end user of a SaaS application for example. This is because they’re all taken from issues reported on open source Java repos.

Secondly, a brief eyeball reveals that many of the bug reports already contain tests. I did a quick sanity check here to see if the occurrence of the word “test” in the bug report had any effect on success. Successfully reproduced bugs had a 35% probability of containing the word “test”, and unsuccessfully reproduced bugs had a probability of 33%. The result suggests that LIBRO does have an ability to synthesize new tests and is not just regurgitating training data,

Impact

Automating bug reproduction has far-reaching implications. Firstly, it directly boosts developer productivity and satisfaction — developer surveys show they desire such automation ^[5].

Secondly, reproducing tests enables techniques like automated program repair that currently struggle due to the lack of oracles ^[6]. Generating tests from bug reports could expand the applicability of these techniques, enabling us to automate away more of the tedious maintenance tasks within the software development lifecycle.

Next Steps

We’re working on a new venture unlocking some of these untapped opportunities for high performance teams. If you’re a developer or you’re managing an engineering team and you want to gain a competitive advantage then reach out!

I’m also interested in how you would like to see AI applied within your team — whether that’s bug triage, testing, code reviews, documentation generation or something completely different. Where’s the pain? — john@atchai.com.

References

[1] G. Fraser and A. Arcuri. Whole test suite generation. IEEE Transactions on Software Engineering, 2013.

[2] M. Soltani et al. Search-based crash reproduction. Empirical Software Engineering, 2020.

[3] Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2022. Large language models are few-shot testers: Exploring llm-based general bug reproduction. arXiv preprint arXiv:2209.11515 (2022).

[4] M. Soltani et al. Search-based crash reproduction. Empirical Software Engineering, 2020.

[5] E. Daka and G. Fraser. A survey on unit testing practices and problems. IEEE ISSRE 2014.

[6] A. Koyuncu et al. Ifixr: Bug report driven program repair. ESEC/FSE 2019.