ESEC/FSE

FSE 2025

Detecting and Reducing the Factual Hallucinations of Large Language Models with Metamorphic Testing

Abstract: Question answering (QA) is a fundamental task of a large language model (LLM), which requires LLM to automatically answer human-posed questions in natural language. However, LLMs are known to distort facts and make non-factual statements (hallucination) when dealing with QA tasks, which may affect the deployment of LLMs in real-life situations. In this work, we present DrHall, a method for the detection of factual errors in black-box large language models inspired by metamorphosis testing in software testing. We believe that the model’s hallucination answer is unstable. It is easier to produce different answers to the hallucination by using metamorphic relation (MR) to make the model take different execution paths for re-execution. We empirically evaluate DrHall on three datasets covering natural and code language data, finding that it outperforms existing methods and baselines, often by a large gap. In addition, by transforming DrHall using diverse path sampling, we obtain error correction methods with higher success rates. Our results demonstrate the potential of using MR to mitigate LLM hallucination.

Hallucination Detection in Large Language Models with Metamorphic Relations

Abstract: Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic testing and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM’s response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT’s F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions.

FSE 2024

Abstraction-Aware Inference of Metamorphic Relations

Paper

Abstract: Metamorphic testing is a valuable technique that helps in dealing with the oracle problem. It involves testing software against specifications of its intended behavior given in terms of so called metamorphic relations, statements that express properties relating different software elements (e.g., different inputs, methods, etc). The effective application of metamorphic testing strongly depends on identifying suitable domain-specific metamorphic relations, a challenging task, that is typically manually performed.

This paper introduces MemoRIA, a novel approach that aims at automatically identifying metamorphic relations. The technique focuses on a particular kind of metamorphic relation, which asserts equivalences between methods and method sequences. MemoRIA works by first generating an object-protocol abstraction of the software being tested, then using fuzzing to produce candidate relations from the abstraction, and finally validating the candidate relations through run-time analysis. A SAT-based analysis is used to eliminate redundant relations, resulting in a concise set of metamorphic relations for the software under test. We evaluate our technique on a benchmark consisting of 22 Java subjects taken from the literature, and compare MemoRIA with the metamorphic relation inference technique SBES. Our results show that by incorporating the object protocol abstraction information, MemoRIA is able to more effectively infer meaningful metamorphic relations, that are also more precise, compared to SBES, measured in terms of mutation analysis. Also, the SAT-based reduction allows us to significantly reduce the number of reported metamorphic relations, while in general having a small impact in the bug finding ability of the corresponding obtained relations.

A Miss Is as Good as A Mile: Metamorphic Testing for Deep Learning Operators

Paper
Code

Abstract: Deep learning (DL) is a critical tool for real-world applications, and comprehensive testing of DL models is vital to ensure their quality before deployment. However, recent studies have shown that even subtle deviations in DL operators can result in catastrophic consequences, underscoring the importance of rigorous testing of these components. Unlike testing other DL system components, operator analysis poses unique challenges due to complex inputs and uncertain outputs. The existing DL operator testing approach has limitations in terms of testing efficiency and error localization. In this paper, we propose Meta, a novel operator testing framework based on metamorphic testing that automatically tests and assists bug location based on metamorphic relations (MRs). Meta distinguishes itself in three key ways: (1) it considers both parameters and input tensors to detect operator errors, enabling it to identify both implementation and precision errors; (2) it uses MRs to guide the generation of more effective inputs (i.e., tensors and parameters) in less time; (3) it assists the precision error localization by tracing the error to the input level of the operator based on MR violations. We designed 18 MRs for testing 10 widely used DL operators. To assess the effectiveness of Meta, we conducted experiments on 13 released versions of 5 popular DL libraries. Our results revealed that Meta successfully detected 41 errors, including 14 new ones that were reported to the respective platforms and 8 of them are confirmed/fixed. Additionally, Meta demonstrated high efficiency, outperforming the baseline by detecting ∼2 times more errors of the baseline. Meta is open-sourced and available at https://github.com/TDY-raedae/Medi-Test.

Metamorphic Testing of Secure Multi-Party Computation (MPC) Compilers

Paper

The demanding need to perform privacy-preserving computations among multiple data owners has led to the prosperous development of secure multi-party computation (MPC) protocols. MPC offers protocols for parties to jointly compute a function over their inputs while keeping those inputs private. To date, MPC has been widely adopted in various real-world, privacy-sensitive sectors, such as healthcare and finance. Moreover, to ease the adoption of MPC, industrial and academic MPC compilers have been developed to automatically translate programs describing arbitrary MPC procedures into low-level MPC executables.

Compiling high-level descriptions into high-efficiency MPC executables is challenging: the compilation often involves converting high-level languages into several intermediate representations (IR), e.g., arithmetic or boolean circuits, optimizing the computation/communication cost, and picking proper MPC protocols (and underlying virtual machines) for a particular task and threat model. Various optimizations and heuristics are employed during the compilation procedure to improve the efficiency of the generated MPC executables.

Despite the prosperous adoption of MPC compilers by industrial vendors and academia, a principled and systematic understanding of the correctness of MPC compilers does not yet exist. To fill this critical gap, this paper introduces MT-MPC, a metamorphic testing (MT) framework specifically designed for MPC compilers to effectively uncover erroneous compilations. Our approach proposes three metamorphic relations (MRs) that are tailored for MPC programs to mutate high-level MPC programs (compiler inputs). We then examine if MPC compilers yield semantics-equivalent MPC executables regarding the original and mutated MPC programs by comparing their execution results.

Real-world MPC compilers exhibit a high level of engineering quality. Nevertheless, we detected 4,772 inputs that can result in erroneous compilations in three popular MPC compilers available on the market. While the discovered error-triggering inputs do not cause the MPC compilers to crash directly, they can lead to the generation of incorrect MPC executables, jeopardizing the underlying dependability of the computation. With substantial manual effort and help from the MPC compiler developers, we uncovered thirteen bugs in these MPC compilers by debugging them using the error-triggering inputs. Our proposed testing frameworks and findings can be used to guide developers in their efforts to improve MPC compilers.

FSE 2021

Generating Metamorphic Relations for Cyber-physical Systems with Genetic Programming: An Industrial Case Study

Paper

Abstract: One of the major challenges in the verification of complex industrial Cyber-Physical Systems is the difficulty of determining whether a particular system output or behaviour is correct or not, the so-called test oracle problem. Metamorphic testing alleviates the oracle problem by reasoning on the relations that are expected to hold among multiple executions of the system under test, which are known as Metamorphic Relations (MRs). However, the development of effective MRs is often challenging and requires the involvement of domain experts. In this paper, we present a case study aiming at automating this process. To this end, we implemented GAssertMRs, a tool to automatically generate MRs with genetic programming. We assess the cost-effectiveness of this tool in the context of an industrial case study from the elevation domain. Our experimental results show that in most cases GAssertMRs outperforms the other baselines, including manually generated MRs developed with the help of domain experts. We then describe the lessons learned from our experiments and we outline the future work for the adoption of this technique by industrial practitioners.

New Visions on Metamorphic Testing after a Quarter of a Century of Inception

Paper

Abstract: Metamorphic testing (MT) was introduced about a quarter of a century ago. It is increasingly being accepted by researchers and the industry as a useful testing technique. The studies, research results, applications, and extensions of MT have given us many insights and visions for its future. Our visions include: MRs will be a practical means to top up test case generation techniques, beyond the alleviation of the test oracle problem; MT will not only be a standalone technique, but conveniently integrated with other methods; MT and MRs will evolve beyond software testing, or even beyond verification; MRs may be anything that you can imagine, beyond the necessary properties of algorithms; MT research will be beyond empirical studies and move toward a theoretical foundation; MT will not only bring new concepts to software testing but also new concepts to other disciplines; MRs will alleviate the reliable test set problem beyond traditional approaches. These visions may help researchers explore the challenges and opportunities for MT in the next decade.

Metamorphic Testing of Datalog Engines

Paper

Abstract: Datalog is a popular query language with applications in several domains. Like any complex piece of software, Datalog engines may contain bugs. The most critical ones manifest as incorrect results when evaluating queries—we refer to these as query bugs. Given the wide applicability of the language, query bugs may have detrimental consequences, for instance, by compromising the soundness of a program analysis that is implemented and formalized in Datalog. In this paper, we present the first metamorphic-testing approach for detecting query bugs in Datalog engines. We ran our tool on three mature engines and found 13 previously unknown query bugs, some of which are deep and revealed critical semantic issues.