Research

Today’s IT systems, and the interactions among them, become more and more complex. Power grid blackouts, airplane crashes, failures of medical devices, cruise control devices out of control are just a few examples of incidents due to component failures and unexpected interactions of subsystems under conditions that have not been anticipated during system design and testing. The failure of one component may entail a cascade of failures in other components; several components may also fail independently. In such cases, determining the root cause(s) of a system-level failure and elucidating the exact scenario that led to the failure is today a complex and tedious task that requires significant expertise. In most cases, the exact cause of a failure remains unknown until manual post-mortem analysis of logs and components.

Medical devices present particular challenges in this respect. Adverse events (malfunctions that have results in harm to a patient) are required to be reported to regulators. The regulators then work with device manufacturers to determine the cause and decide whether a recall is needed to correct the problem. The problem of fault ascription is exacerbated by the fact that a patient is typically treated using multiple devices, which can interfere with each other in various intended and unintended ways. Addressing this problem requires data logging and automated analysis of logged data.

However, we currently lack a comprehensive formal framework for causality analysis. Existing approaches rely on a causal model to be given, rather than on discovering causality from a log. Moreover they are defined on static structures and not easily generalizable to execution logs over time.

The goal of Causalysis is to advance the state of the art in determining the causes of a system failure from a log so as to make the approach applicable to real-world safety-critical embedded systems. Its originality lies in the fact that existing approaches either require active probing of the system and rely on multiple executions, or are ad hoc in nature. Causalysis, on the other hand, will establish a formal foundation for such analysis on a single execution trace.

Scientific progress

First year

Our current causality analysis approach takes as input a set of components constituting a system, with their specification, a system property P, and a vector of execution traces that violates P. Given a set of suspected components to be analyzed for causality, we remove the faulty behaviors of those components from the traces and prune the unreachable suffixes, thus building the unaffected prefixes. We then produce the counterfactuals, that is, the set of traces that are possible continuations of the unaffected prefixes that respect the component specifications. The set of suspected components is considered a necessary cause if the counterfactuals satisfy P. This approach has some drawbacks: we do not know how to prolong the traces of faulty components, which impairs the precision on some cases, and the approach only deals with blackbox components. What’s more, we need the full traces to perform the analysis, which impacts scalability.

During the first year of Causalysis we have been studying several ways for building a more powerful framework for causality analysis.

More precise analysis through separable components

The first is an enhancement of the precision of the analysis by using separable components. Currently, if a component is faulty in its unaffected prefix, the prefix is not continued during the construction of the counterfactuals, in the absence of any knowledge about the possible behaviors of the faulty component. We have introduced in [1] the concept of separable components. A separable component has the property that if we feed it a given input trace, its output trace is deterministic, even when this component is faulty. We have also revisited the concept of horizontal causality, which assesses the impact of a set of faulty component on the failure of another component. Using those two notions, we have achieved a better accuracy in the analysis, by using the actual behavior of the separable components, and assessing if a component failure is due to itself, or the failure of other component, to reduce the sets of causes. We have demonstrated the utility of the extended analysis with a case study for a closed-loop patient-controlled analgesia system.

[1] S. Wang, Y. Geoffroy, G. Gössler, O. Sokolsky, and I. Lee. A hybrid approach to causality analysis. In Runtime Verification 2015, volume 9333 of LNCS, pages 250-265. Springer, 2015.

Grey-box components

The second is the use of grey-box components. By grey-box component we understand here a black-box component for which the output depends only on a bounded history of inputs, where the bound is known. E.g. for a component that computes a floating average over the last 10 values, we need to keep the last 10 values to compute its current output. Causality analysis may have to deal with large systems, or systems running for a long period of time. Therefore, the logs may be very large and costly to process. Using grey-box components, in conjunction with fault detection mechanism, we can keep only the necessary data to be able to perform a causality analysis, without impacting the result of the analysis.

Robustness

We studied the concept of complete assumptions, for systems whose components are specified by contracts (couple assumptions/guaranties). The assumptions of a set S of components are complete, with respect to a property P, if as long as the assumptions are verified and the components in S are not faulty, the behavior of the components feeding data to the components in S does not impact P. This means, that S makes the system robust wrt. faults of upstream components. In conjunction with grey-box components, it helps reducing the size of the logs, as we do not need to log the upstream components, as long as the components in S are not faulty and their hypotheses are satisfied. Note that it is not always possible to build such assumptions for each set of components. Nevertheless, it generally is possible to make the assumption of certain set of components complete. We developed an early algorithm to build complete assumptions, when possible; the set of components for which to make the assumptions complete is a design choice.

Second year

A game-based formalization of causality for systems composed of black- and white-box components

So far we have only considered the case of causality analysis of black-box components for which a formal specification is provided, while the actual implementation is not known. However, in practice, many components are designed starting from requirements in natural language, from which a model is built that is then translated into code; sometimes the code is written directly. Components for which a model (or code), but no formal specification is available, are called white-box components.

Analyzing the causes of a failure exhibited by an execution trace of a single white-box system is termed fault localization and has been extensively studied in the literature. Fault localization does not distinguish, however, the responsibility of different components, as required e.g. to determine the liability of component vendors. Little research has been done on causality analysis for a system of white-box components, or combining black- and white-box components.

In the case where component behaviors are not fully deterministic – or appear not to be, due to partial observability -, the causes of the failure may be determined as the component (or set of components) that might have prevented the failure with different choices, or that enforced the failure by their choices. Yoann Geoffroy is working on a formalization of this approach in a game theoretical setting as part of his PhD thesis.