ICSE

ICSE 2025

ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?

Paper

Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME

ROSA: Finding Backdoors with Fuzzing

Paper

A code-level backdoor is a hidden access, programmed and concealed within the code of a program. For instance, hard-coded credentials planted in the code of a file server application would enable maliciously logging into all deployed instances of this application. Confirmed software supplychain attacks have led to the injection of backdoors into popular open-source projects, and backdoors have been discovered in various router firmware. Manual code auditing for backdoors is challenging and existing semi-automated approaches can handle only a limited scope of programs and backdoors, while requiring manual reverse-engineering of the audited (binary) program. Graybox fuzzing (automated semi-randomized testing) has grown in popularity due to its success in discovering vulnerabilities and hence stands as a strong candidate for improved backdoor detection. However, current fuzzing knowledge does not offer any means to detect the triggering of a backdoor at runtime.

In this work we introduce ROSA, a novel approach (and tool) which combines a state-of-the-art fuzzer (AFL++) with a new metamorphic test oracle, capable of detecting runtime backdoor triggers. To facilitate the evaluation of ROSA, we have created ROSARUM, the first openly available benchmark for assessing the detection of various backdoors in diverse programs. Experimental evaluation shows that ROSA has a level of robustness, speed and automation similar to classical fuzzing. It finds all 17 authentic or synthetic backdooors from ROSARUM in 1h30 on average. Compared to existing detection tools, it can handle a diversity of backdoors and programs and it does not rely on manual reverse-engineering of the fuzzed binary code.

Paper

Knowledge distillation compresses large language models (LLMs) into more compact and efficient versions that achieve similar accuracy on code-related tasks. However, as we demonstrate in this study, compressed models are four times less robust than the original LLMs when evaluated with metamorphic code. They have a 440% higher probability of misclassifying code clones due to minor changes in the code fragment under analysis, such as replacing parameter names with synonyms. To address this issue, we propose MORPH, a method that combines metamorphic testing with many-objective optimization for a robust distillation of LLMs for code. MORPH efficiently explores the models’ configuration space and generates Paretooptimal models that effectively balance accuracy, efficiency, and robustness to metamorphic code. Metamorphic testing measures robustness as the number of code fragments for which a model incorrectly makes different predictions between the original and their equivalent metamorphic variants (prediction flips). We evaluate MORPH on two tasks—code clone and vulnerability detection—targeting CodeBERT and GraphCodeBERT for distillation. Our comparison includes MORPH, the state-of-theart distillation method AVATAR, and the fine-tuned non-distilled LLMs. Compared to AVATAR, MORPH produces compressed models that are (i) 47% more robust, (ii) 25% more efficient (fewer FLOPs), while maintaining (iii) equal or higher accuracy (up to +6%), and (iv) similar model size.

ICSE 2024

EDEFuzz: A Web API Fuzzer for Excessive Data Exposures

Paper

APIs often transmit far more data to client applications than they need, and in the context of web applications, often do so over public channels. This issue, termed Excessive Data Exposure (EDE), was OWASP’s third most significant API vulnerability of 2019. However, there are few automated tools—either in research or industry—to effectively find and remediate such issues. This is unsurprising as the problem lacks an explicit test oracle: the vulnerability does not manifest through explicit abnormal behaviours (e.g., program crashes or memory access violations).

In this work, we develop a metamorphic relation to tackle that challenge and build the first fuzzing tool—that we call EDEFuzz—to systematically detect EDEs. EDEFuzz can significantly reduce false negatives that occur during manual inspection and ad-hoc text-matching techniques, the current most-used approaches.

We tested EDEFuzz against the sixty-nine applicable targets from the Alexa Top-200 and found 33,365 potential leaks—illustrating our tool’s broad applicability and scalability. In a more-tightly controlled experiment of eight popular websites in Australia, EDEFuzz achieved a high true positive rate of 98.65% with minimal configuration, illustrating our tool’s accuracy and efficiency.

ICSE 2023

MorphQ: Metamorphic Testing of the Qiskit Quantum Computing Platform

Paper

Abstract: As quantum computing is becoming increasingly popular, the underlying quantum computing platforms are growing both in ability and complexity. Unfortunately, testing these platforms is challenging due to the relatively small number of existing quantum programs and because of the oracle problem, i.e., a lack of specifications of the expected behavior of programs. This paper presents MorphQ, the first metamorphic testing approach for quantum computing platforms. Our two key contributions are (i) a program generator that creates a large and diverse set of valid (i.e., non-crashing) quantum programs, and (ii) a set of program transformations that exploit quantum-specific metamorphic relationships to alleviate the oracle problem. Evaluating the approach by testing the popular Qiskit platform shows that the approach creates over 8k program pairs within two days, many of which expose crashes. Inspecting the crashes, we find 13 bugs, nine of which have already been confirmed. MorphQ widens the slim portfolio of testing techniques of quantum computing platforms, helping to create a reliable software stack for this increasingly important field.

MTTM: Metamorphic Testing for Textual Content Moderation Softwar

Paper

Abstract: The exponential growth of social media platforms such as Twitter and Facebook has revolutionized textual communication and textual content publication in human society. However, they have been increasingly exploited to propagate toxic content, such as hate speech, malicious advertisement, and pornography, which can lead to highly negative impacts (e.g., harmful effects on teen mental health). Researchers and practitioners have been enthusiastically developing and extensively deploying textual content moderation software to address this problem. However, we find that malicious users can evade moderation by changing only a few words in the toxic content. Moreover, modern content moderation software’s performance against malicious inputs remains underexplored. To this end, we propose MTTM, a Metamorphic Testing framework for Textual content Moderation software. Specifically, we conduct a pilot study on 2, 000 text messages collected from real users and summarize eleven metamorphic relations across three perturbation levels: character, word, and sentence. MTTM employs these metamorphic relations on toxic textual contents to generate test cases, which are still toxic yet likely to evade moderation. In our evaluation, we employ MTTM to test three commercial textual content moderation software and two state-of-the-art moderation algorithms against three kinds of toxic content. The results show that MTTM achieves up to 83.9%, 51%, and 82.5% error finding rates (EFR) when testing commercial moderation software provided by Google, Baidu, and Huawei, respectively, and it obtains up to 91.2% EFR when testing the state-of-the-art algorithms from the academy. In addition, we leverage the test cases generated by MTTM to retrain the model we explored, which largely improves model robustness 0% ~ 5.9% EFR) while maintaining the accuracy on the original test set. A demo can be found in this link.

Metamorphic Shader Fusion for Testing Graphics Shader Compilers

Paper

Abstract: Computer graphics are powered by graphics APIs (e.g., OpenGL, Direct3D) and their associated shader compilers, which render high-quality images by compiling and optimizing user-written high-level shader programs into GPU machine code. Graphics rendering is extensively used in production scenarios like virtual reality (VR), gaming, autonomous driving, and robotics. Despite the development by industrial manufacturers such as Intel, Nvidia, and AMD, shader compilers - like traditional software - may produce ill-rendered outputs. In turn, these errors may result in negative results, from poor user experience in entertainment to accidents in driving assistance systems. This paper introduces FSHADER, a metamorphic testing (MT) framework designed specifically for shader compilers to uncover erroneous compilations and optimizations. FSHADER tests shader compilers by mutating input shader programs via four carefully-designed metamorphic relations (MRs). In particular, FSHADER fuses two shader programs via an MR and checks the visual consistency between the image rendered from the fused shader program with the output of fusing individually rendered images. Our study of 12 shader compilers covers five mainstream GPU vendors, including Intel, AMD, Nvidia, ARM, and Apple. We successfully uncover over 16K error-triggering inputs that generate incorrect rendering outputs. We manually locate and characterize buggy optimization places, and developers have confirmed representative bugs.

Metamorphic Testing with Causal Graphs

Paper

Abstract: Metamorphic testing provides a means by which to generate succinct test oracles that can apply to large input spaces. For this it depends on the formulation of metamorphic relations, which generally require extensive domain expertise and human input. To address this problem, we present a model-based testing approach that can automatically generate metamorphic relations and associated tests. Our approach is motivated by the observation that metamorphic testing is a fundamentally causal task. We show how it is possible to leverage lightweight graph-based modelling techniques from the field of causal inference to specify causal properties of the system-under-test. Through a series of controlled experiments, we find that the proposed approach is robust to misspecification and can test evasive causal relationships (i.e. those that are difficult to exercise and observe) when combined with an appropriate test generation strategy. We also apply the approach to two case studies from the Defects4J framework with known bugs that affect causal behaviour. The results of these case studies suggest that the approach is not only useful for catching bugs affecting causal structure, but also alerting the user to inaccuracies in the specification.

CCTest: Testing and Repairing Code Completion Systems

Paper

Abstract: Code completion, a highly valuable topic in the software development domain, has been increasingly promoted for use by recent advances in large language models (LLMs). To date, visible LLM-based code completion frameworks such as GitHub Copilot and GPT are trained using deep learning over vast quantities of unstructured text and open source code. As the paramount component and the cornerstone in daily programming tasks, code completion has largely boosted professionals’ efficiency in building real-world software systems. In contrast to this flourishing market, we find that code completion systems often output suspicious results, and to date, an automated testing and enhancement framework for code completion systems is not available. This research proposes CCTEST, a framework to test and repair code completion systems in black-box settings. CCTEST features a set of novel mutation strategies, namely program structure-consistent (PSC) mutations, to generate mutated code completion inputs. Then, it detects inconsistent outputs, representing possibly erroneous cases, from all the completed code cases. Moreover, CCTEST repairs the code completion outputs by selecting the output that mostly reflects the “average” appearance of all output cases, as the final output of the code completion systems. With around 18K test inputs, we detected 33,540 inputs that can trigger erroneous cases (with a true positive rate of 86%) from eight popular LLM-based code completion systems. With repairing, we show that the accuracy of code completion systems is notably increased by 40% and 67% with respect to BLEU score and Levenshtein edit similarity.

ICSE 2015

Metamorphic Model-Based Testing Applied on NASA DAT - An Experience Report

Paper

Abstract: Testing is necessary for all types of systems, but becomes difficult when the tester cannot easily determine whether the system delivers the correct result or not. NASA’s Data Access Toolkit allows NASA analysts to query a large database of telemetry data. Since the user is unfamiliar with the data and several data transformations can occur, it is impossible to determine whether the system behaves correctly or not in full scale production situations. Small scale testing was already conducted manually by other teams and unit testing was conducted on individual functions. However, there was still a need for full scale acceptance testing on a broad scale. We describe how we addressed this testing problem by applying the idea of metamorphic testing [1]. Specifically, we base it on equivalence of queries and by using the system itself for testing. The approach is implemented using a model-based testing approach in combination with a test data generation and test case outcome analysis strategy. We also discuss some of the issues that were detected using this approach.