ICSE
ICSE 2024
EDEFuzz: A Web API Fuzzer for Excessive Data Exposures
APIs often transmit far more data to client applications than they need, and in the context of web applications, often do so over public channels. This issue, termed Excessive Data Exposure (EDE), was OWASP’s third most significant API vulnerability of 2019. However, there are few automated tools—either in research or industry—to effectively find and remediate such issues. This is unsurprising as the problem lacks an explicit test oracle: the vulnerability does not manifest through explicit abnormal behaviours (e.g., program crashes or memory access violations).
In this work, we develop a metamorphic relation to tackle that challenge and build the first fuzzing tool—that we call EDEFuzz—to systematically detect EDEs. EDEFuzz can significantly reduce false negatives that occur during manual inspection and ad-hoc text-matching techniques, the current most-used approaches.
We tested EDEFuzz against the sixty-nine applicable targets from the Alexa Top-200 and found 33,365 potential leaks—illustrating our tool’s broad applicability and scalability. In a more-tightly controlled experiment of eight popular websites in Australia, EDEFuzz achieved a high true positive rate of 98.65% with minimal configuration, illustrating our tool’s accuracy and efficiency.
ICSE 2023
MorphQ: Metamorphic Testing of the Qiskit Quantum Computing Platform
Abstract: As quantum computing is becoming increasingly popular, the underlying quantum computing platforms are growing both in ability and complexity. Unfortunately, testing these platforms is challenging due to the relatively small number of existing quantum programs and because of the oracle problem, i.e., a lack of specifications of the expected behavior of programs. This paper presents MorphQ, the first metamorphic testing approach for quantum computing platforms. Our two key contributions are (i) a program generator that creates a large and diverse set of valid (i.e., non-crashing) quantum programs, and (ii) a set of program transformations that exploit quantum-specific metamorphic relationships to alleviate the oracle problem. Evaluating the approach by testing the popular Qiskit platform shows that the approach creates over 8k program pairs within two days, many of which expose crashes. Inspecting the crashes, we find 13 bugs, nine of which have already been confirmed. MorphQ widens the slim portfolio of testing techniques of quantum computing platforms, helping to create a reliable software stack for this increasingly important field.
MTTM: Metamorphic Testing for Textual Content Moderation Softwar
Abstract: The exponential growth of social media platforms such as Twitter and Facebook has revolutionized textual communication and textual content publication in human society. However, they have been increasingly exploited to propagate toxic content, such as hate speech, malicious advertisement, and pornography, which can lead to highly negative impacts (e.g., harmful effects on teen mental health). Researchers and practitioners have been enthusiastically developing and extensively deploying textual content moderation software to address this problem. However, we find that malicious users can evade moderation by changing only a few words in the toxic content. Moreover, modern content moderation software’s performance against malicious inputs remains underexplored. To this end, we propose MTTM, a Metamorphic Testing framework for Textual content Moderation software. Specifically, we conduct a pilot study on 2, 000 text messages collected from real users and summarize eleven metamorphic relations across three perturbation levels: character, word, and sentence. MTTM employs these metamorphic relations on toxic textual contents to generate test cases, which are still toxic yet likely to evade moderation. In our evaluation, we employ MTTM to test three commercial textual content moderation software and two state-of-the-art moderation algorithms against three kinds of toxic content. The results show that MTTM achieves up to 83.9%, 51%, and 82.5% error finding rates (EFR) when testing commercial moderation software provided by Google, Baidu, and Huawei, respectively, and it obtains up to 91.2% EFR when testing the state-of-the-art algorithms from the academy. In addition, we leverage the test cases generated by MTTM to retrain the model we explored, which largely improves model robustness 0% ~ 5.9% EFR) while maintaining the accuracy on the original test set. A demo can be found in this link.
Metamorphic Shader Fusion for Testing Graphics Shader Compilers
Abstract: Computer graphics are powered by graphics APIs (e.g., OpenGL, Direct3D) and their associated shader compilers, which render high-quality images by compiling and optimizing user-written high-level shader programs into GPU machine code. Graphics rendering is extensively used in production scenarios like virtual reality (VR), gaming, autonomous driving, and robotics. Despite the development by industrial manufacturers such as Intel, Nvidia, and AMD, shader compilers - like traditional software - may produce ill-rendered outputs. In turn, these errors may result in negative results, from poor user experience in entertainment to accidents in driving assistance systems. This paper introduces FSHADER, a metamorphic testing (MT) framework designed specifically for shader compilers to uncover erroneous compilations and optimizations. FSHADER tests shader compilers by mutating input shader programs via four carefully-designed metamorphic relations (MRs). In particular, FSHADER fuses two shader programs via an MR and checks the visual consistency between the image rendered from the fused shader program with the output of fusing individually rendered images. Our study of 12 shader compilers covers five mainstream GPU vendors, including Intel, AMD, Nvidia, ARM, and Apple. We successfully uncover over 16K error-triggering inputs that generate incorrect rendering outputs. We manually locate and characterize buggy optimization places, and developers have confirmed representative bugs.
Metamorphic Testing with Causal Graphs
Abstract: Metamorphic testing provides a means by which to generate succinct test oracles that can apply to large input spaces. For this it depends on the formulation of metamorphic relations, which generally require extensive domain expertise and human input. To address this problem, we present a model-based testing approach that can automatically generate metamorphic relations and associated tests. Our approach is motivated by the observation that metamorphic testing is a fundamentally causal task. We show how it is possible to leverage lightweight graph-based modelling techniques from the field of causal inference to specify causal properties of the system-under-test. Through a series of controlled experiments, we find that the proposed approach is robust to misspecification and can test evasive causal relationships (i.e. those that are difficult to exercise and observe) when combined with an appropriate test generation strategy. We also apply the approach to two case studies from the Defects4J framework with known bugs that affect causal behaviour. The results of these case studies suggest that the approach is not only useful for catching bugs affecting causal structure, but also alerting the user to inaccuracies in the specification.
CCTest: Testing and Repairing Code Completion Systems
Abstract: Code completion, a highly valuable topic in the software development domain, has been increasingly promoted for use by recent advances in large language models (LLMs). To date, visible LLM-based code completion frameworks such as GitHub Copilot and GPT are trained using deep learning over vast quantities of unstructured text and open source code. As the paramount component and the cornerstone in daily programming tasks, code completion has largely boosted professionals’ efficiency in building real-world software systems. In contrast to this flourishing market, we find that code completion systems often output suspicious results, and to date, an automated testing and enhancement framework for code completion systems is not available. This research proposes CCTEST, a framework to test and repair code completion systems in black-box settings. CCTEST features a set of novel mutation strategies, namely program structure-consistent (PSC) mutations, to generate mutated code completion inputs. Then, it detects inconsistent outputs, representing possibly erroneous cases, from all the completed code cases. Moreover, CCTEST repairs the code completion outputs by selecting the output that mostly reflects the “average” appearance of all output cases, as the final output of the code completion systems. With around 18K test inputs, we detected 33,540 inputs that can trigger erroneous cases (with a true positive rate of 86%) from eight popular LLM-based code completion systems. With repairing, we show that the accuracy of code completion systems is notably increased by 40% and 67% with respect to BLEU score and Levenshtein edit similarity.
ICSE 2015
Metamorphic Model-Based Testing Applied on NASA DAT - An Experience Report
Abstract: Testing is necessary for all types of systems, but becomes difficult when the tester cannot easily determine whether the system delivers the correct result or not. NASA’s Data Access Toolkit allows NASA analysts to query a large database of telemetry data. Since the user is unfamiliar with the data and several data transformations can occur, it is impossible to determine whether the system behaves correctly or not in full scale production situations. Small scale testing was already conducted manually by other teams and unit testing was conducted on individual functions. However, there was still a need for full scale acceptance testing on a broad scale. We describe how we addressed this testing problem by applying the idea of metamorphic testing [1]. Specifically, we base it on equivalence of queries and by using the system itself for testing. The approach is implemented using a model-based testing approach in combination with a test data generation and test case outcome analysis strategy. We also discuss some of the issues that were detected using this approach.