ISSTA

ISSTA 2025

Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing

Abstract: Deep learning (DL) frameworks are essential to DL-based software systems, and bugs in frameworks may lead to substantial disasters, thus requiring effective testing. Researchers adopt DL models or single interfaces as test inputs and analyze their execution results to detect bugs. However, floating-point errors, inherent randomness, and the complexity of test inputs make it challenging to analyze execution results effectively. That is to say, existing methods suffer from the lack of suitable test oracles. Some researchers utilize metamorphic testing to tackle this challenge. They design Metamorphic Relations (MRs) based on input data and parameter settings of a single framework interface to generate equivalent test inputs, ensuring consistent execution results between original and generated inputs. Despite their promising effectiveness, they still face certain limitations. (1) MRs overlook structural complexity, limiting test input diversity; (2) MRs focus on single interfaces, which limits generalization and necessitates additional adaptations; (3) Bugs detected by them are related to the single interfaces and far from those exposed in multi-interface combinations and execution states (e.g., resource usage), which are common in real applications. To address these limitations, we propose ModelMeta, a model-level metamorphic testing method for DL frameworks with four MRs focused on model structure and calculation logic. ModelMeta inserts external structures to generate new models with consistent outputs, increasing interface diversity and detecting bugs without additional MRs. Besides, ModelMeta uses the QR-DQN strategy to guide model generation and then detects bugs from more fine grained perspectives of training loss, memory usage, and execution time.We evaluate the effectiveness of ModelMeta on three popular DL frameworks (i.e., MindSpore, PyTorch, and ONNX) with 17 DL models from 10 real-world tasks ranging from image classification to object detection. Results demonstrate that ModelMeta outperforms state-of-the-art baselines by identifying 27 new combinations of multiple interfaces that existing methods fail to detect. Regarding bug detection, ModelMeta has identified 31 new bugs, of which 27 have been confirmed, and 11 have been fixed. Among the 31 bugs, there are seven bugs that existing methods cannot detect, i.e., five wrong resource usage bugs and two low-efficiency bugs. These results demonstrate the practicality of our method.

QTRAN: Extending Metamorphic-Oracle based Logical Bug Detection Techniques for Multiple-DBMS Dialect Support

Metamorphic testing is a widely used method to detect logical bugs in Database Management Systems (DBMSs), referred to herein as MOLT (Metamorphic-Oracle based Logical Bug Detection Techinique). This technique involves constructing SQL statement pairs, including original and mutated queries, and assessing whether the execution results conform to predefined metamorphic relations to detect logical bugs. However, current MOLTs rely heavily on specific DBMS grammars to generate valid SQL statement pairs, which makes it challenging to adapt these techniques to various DBMSs with different grammatical structures. As a result, only a few popular DBMSs, such as PostgreSQL, MySQL, and MariaDB, are supported by existing MOLTs, with extensive manual effort required to expand to other DBMSs. Given that many DBMSs remain inadequately tested, there is a pressing need for a method that enables effortless extension of MOLTs across diverse DBMSs.

In this paper, we propose QTRAN, a novel LLM-powered approach that automatically extends existing MOLTs to various DBMSs. Our key insight is to translate SQL statement pairs to target DBMSs for metamorphic testing from existing MOLTs using LLMs. To address the challenges of LLMs’ limited understanding of dialect differences and metamorphic mechanisms, we propose a two-phase approach comprising the transfer and mutation phases. QTRAN tackles these challenges by drawing inspiration from the developer’s process of creating a MOLT, which includes understanding the grammar of the target DBMS to generate original queries and employing a mutator for customized mutations. The transfer phase is designed to identify potential dialects and leverage information from SQL documents to enhance query retrieval, enabling LLMs to translate original queries across different DBMSs accurately. During the mutation phase, we gather SQL statement pairs from existing MOLTs to fine-tune the pretrained model, tailoring it specifically for mutation tasks. Then we employ the customized LLM to mutate the translated original queries, preserving the defined relationships necessary for metamorphic testing.

We implement our approach as a tool and apply it to extend four state-of-the-art MOLTs for eight DBMSs: MySQL, MariaDB, TiDB, PostgreSQL, SQLite, MonetDB, DuckDB, and ClickHouse. The evaluation results show that over 99% of the SQL statement pairs transfered by QTRAN satisfy the metamorphic relations required for testing. Furthermore, we have detected 24 logical bugs among these DBMSs, with 16 confirmed as unique and previously unknown bugs. We believe that the generality of QTRAN can significantly enhance the reliability of DBMSs.

ISSTA 2023

Dependency-Aware Metamorphic Testing of Datalog Engines

Paper

Abstract: Datalog is a declarative query language with wide applicability, especially in program analysis. Queries are evaluated by Datalog engines, which are complex and thus prone to returning incorrect results. Such bugs, called query bugs, may compromise the soundness of upstream program analyzers, having potentially detrimental consequences in safety-critical settings.

To address this issue, we develop a metamorphic testing approach for detecting query bugs in Datalog engines. In comparison to existing work, our approach is based on rich precedence information capturing dependencies among relations in the program. This enables much more general and effective metamorphic transformations. We implement our approach in DLSmith, which detected 16 previously unknown query bugs in four Datalog engines.

ECSTATIC: Automatic Configuration-Aware Testing and Debugging of Static Analysis Tools

Paper

Abstract: Static analyses are powerful tools that can serve as a complement to dynamic approaches such as testing. In order to ensure generality, many static analysis tools are configurable. However, these configurations can make testing and debugging more difficult. To address this issue, we introduce a new tool, ECSTATIC, which leverages partial order relations between analysis configuration options to automatically test and debug static analyzers, even without ground truths. ECSTATIC’s results are reproducible by virtue of running within Docker containers, and ECSTATIC provides clear extension interfaces for users to add their own tools and input programs. We evaluated ECSTATIC on four popular dataflow analysis tools, and found 74 bugs in all four tools. We also found that ECSTATIC’s novel two-staged delta debugging was able to reduce real-world programs by 50%, compared to a baseline of 6%.

ISSTA 2022

One Step Further: Evaluating Interpreters using Metamorphic Testing

Paper

Abstract: The black-box nature of the Deep Neural Network (DNN) makes it difficult for people to understand why it makes a specific decision, which restricts its applications in critical tasks. Recently, many interpreters (interpretation methods) are proposed to improve the transparency of DNNs by providing relevant features in the form of a saliency map. However, different interpreters might provide different interpretation results for the same classification case, which motivates us to conduct the robustness evaluation of interpreters.

However, the biggest challenge of evaluating interpreters is the testing oracle problem, i.e., hard to label ground-truth interpretation results. To fill this critical gap, we first use the images with bounding boxes in the object detection system and the images inserted with backdoor triggers as our original ground-truth dataset. Then, we apply metamorphic testing to extend the dataset by three operators, including inserting an object, deleting an object, and feature squeezing the image background. Our key intuition is that after the three operations which do not modify the primary detected objects, the interpretation results should not change for good interpreters. Finally, we measure the qualities of interpretation results quantitatively with the Intersection-over-Minimum (IoMin) score and evaluate interpreters based on the statistics of metamorphic relation’s failures.

We evaluate seven popular interpreters on 877,324 metamorphic images in diverse scenes. The results show that our approach can quantitatively evaluate interpreters’ robustness, where Grad-CAM provides the most reliable interpretation results among the seven interpreters.

Metamorphic Relations via Relaxations: An Approach to Obtain Oracles for Action-Policy Testing

Paper
Code

Abstract: Testing is a promising way to gain trust in a learned action policy π, in particular if π is a neural network. A “bug” in this context constitutes undesirable or fatal policy behavior, e.g., satisfying a failure condition. But how do we distinguish whether such behavior is due to bad policy decisions, or whether it is actually unavoidable under the given circumstances? This requires knowledge about optimal solutions, which defeats the scalability of testing. Related problems occur in software testing when the correct program output is not known.

Metamorphic testing addresses this issue through metamorphic relations, specifying how a given change to the input should affect the output, thus providing an oracle for the correct output. Yet, how do we obtain such metamorphic relations for action policies? Here, we show that the well explored concept of relaxations in the Artificial Intelligence community can serve this purpose. In particular, if state s′ is a relaxation of state s, i.e., s′ is easier to solve than s, and π fails on easier s′ but does not fail on harder s, then we know that π contains a bug manifested on s′.

We contribute the first exploration of this idea in the context of failure testing of neural network policies π learned by reinforcement learning in simulated environments. We design fuzzing strategies for test-case generation as well as metamorphic oracles leveraging simple, manually designed relaxations. In experiments on three single-agent games, our technology is able to effectively identify true bugs, i.e., avoidable failures of π, which has not been possible until now.

ISSTA 2018

Identifying Implementation Bugs in Machine Learning Based Image Classifiers using Metamorphic Testing

Paper

We have recently witnessed tremendous success of Machine Learning (ML) in practical applications. Computer vision, speech recognition and language translation have all seen a near human level performance. We expect, in the near future, most business applications will have some form of ML. However, testing such applications is extremely challenging and would be very expensive if we follow today’s methodologies. In this work, we present an articulation of the challenges in testing ML based applications. We then present our solution approach, based on the concept of Metamorphic Testing, which aims to identify implementation bugs in ML based image classifiers. We have developed metamorphic relations for an application based on Support Vector Machine and a Deep Learning based application. Empirical validation showed that our approach was able to catch 71% of the implementation bugs in the ML applications.