LLMs in Software Testing: Program-Agnostic Unit Test Generation

Automated test generation has become a critical aspect of modern software development. As software systems grow in complexity, writing exhaustive tests manually becomes time-consuming, prone to errors, and can delay release cycles. Automated unit test generation helps developers create reliable, well-tested code efficiency, saving time and improving code quality.

By leveraging tools that can automatically generate tests, developers can focus more on feature development while ensuring robust test coverage. Moreover, automated tests can be generated in seconds, enabling seamless integration into CI/CD pipelines and accelerating feedback loops. This approach is highly cost-effective, as it minimizes the effort needed to maintain manual tests and helps prevent unforeseen issues later in the development process.

In this article, we provide an overview of two open-source solutions developed based on research at Meta, discussing their strengths and limitations.

CodiumAI: Cover-Agent

Among the tools available for automated test generation, CodiumAI’s Cover-Agent stands out due to its focus on program-agnostic approaches and the use of advanced AI. This tool automates unit test generation by leveraging large language models (LLMs). Its development is inspired by Meta's TestGen-LLM, a research-based framework focused on improving automated unit test generation. The tool's primary objective is to enhance code coverage by automatically generating comprehensive and effective tests, making it a practical addition to modern software development workflows.

One of the key features of Cover-Agent is its ability to generate tests until a desired coverage level is achieved iteratively. Developers can set a code coverage target, and the tool will continue generating new tests in multiple iterations, stopping only when the coverage goal is met, or the maximum iteration count is reached. This iterative approach ensures that tests are refined over time and that coverage increases with each cycle.

The tool additionally integrates a coverage parser that validates the effectiveness of generated tests. After running the tests, the parser checks whether the newly generated tests increase overall code coverage (including branch and statement coverage). This feedback loop helps ensure that tests generated by the AI add meaningful value.

Moreover, the test generation process includes a filtration mechanism to ensure that only high-quality tests are retained. A key aspect of this approach is that only tests that pass on the first execution are accepted. This guarantees that the tests can be used reliably for regression testing, thus protecting the code from future bugs. If a test fails on the first run, it is discarded, even if it might have identified a potential bug. This design choice helps avoid flaky or unreliable tests, which are common problems in large-scale software testing.

However, this strict filtration has a potential downside – tests that discover actual bugs may be rejected. Since the filtration process doesn't differentiate between a failed test due to a bug in the code or due to an incorrect test generation, it can result in discarding useful tests that highlight real issues. This is a trade-off to ensure that all retained tests are stable and useful for regression testing, but it also means missed opportunities for early bug detection.

Cover-Agent includes a feature that allows developers to provide additional contextual files alongside the source code under test. By specifying libraries, design documents, or even configuration files, the tool can generate more relevant and context-aware tests. This feature improves the generated test's ability to cater to specific project needs and better align with the structure and behaviors of the codebase under test.

However, while this is a step in the right direction, the use of Retrieval-Augmented Generation (RAG) could significantly enhance this capability. RAG combines retrieval mechanisms with generative models, enabling the tool to dynamically pull relevant context from large codebases or documentation. This would allow the model to generate even more accurate and contextually rich tests, automatically sourcing important details without needing the developer to manually specify additional files. Implementing RAG would help the tool generate more insightful tests, especially in large, evolving codebases where manual file specification can be burdensome and error-prone.

Mutahunter

Mutahunter, like CodiumAI’s Cover-Agent, is built on the same foundational research at Meta, similarly leveraging large language models (LLMs) to improve automated unit test generation. While Cover-Agent focuses on generating new tests, Mutahunter takes it a step further by using mutation testing to verify the robustness of these tests against code mutations. Mutation testing involves introducing small changes, or "mutations," to the code to assess whether the existing tests can detect them. This dual functionality allows Mutahunter to create tests that not only improve coverage but also ensure resilience against potential bugs by catching subtle flaws in the code that regular tests may miss.

Therefore, Mutahunter operates in two key modes to ensure comprehensive test coverage: line coverage and mutation coverage. In line coverage mode, the tool analyzes untested portions of the code and generates new unit tests to cover these areas, which helps increase overall code coverage. In mutation coverage mode, Mutahunter takes a more sophisticated approach, aligning with methods like MuTAP, by validating test robustness against targeted code mutations, thereby boosting the fault-detection capabilities of tests generated with LLMs. If the tests fail to detect the mutations, Mutahunter highlights weaknesses in the test suite, providing insights into where the tests can be improved.

However, the complexity of mutation testing and the computational cost it incurs can be challenging, especially in large-scale projects, where running mutation tests across the entire codebase might introduce performance bottlenecks. Nevertheless, Mutahunter remains an essential tool for improving test robustness and reliability in production environments.

Want to learn more about how we work with LLMs and quality assurance?