Keywords

1 Introduction

The context of this paper is a projectFootnote 1 concerned with the adaptation of legacy software due to changed requirements and technical advances. Specifically, cheap and ubiquitous availability of multi-processor hardware provides a strong incentive to parallelize existing software. In the long term we aim to annotate existing sequential applications written in C with OpenMP directives [14].

We adopt a model-based approach as illustrated in Fig. 1. From given sequential C-code a software model is extracted in a largely automatic fashion. The target is the modelling language ABS (Abstract Behavioral Specification) [7], an active objects-language [4] with formal semantics [9]. ABS is formally defined, free from ambiguity, and it has been designed to be statically analyzable [17]. Therefore, it is possible to use software tools for exhibiting opportunities for parallelization and to generate suitable directives. In this paper we focus on the first stage: model extraction and model validation.

Fig. 1.
figure 1

Model-based parallelization

While abstraction of source code to a modelling language is a standard ingredient of many model checking tool chains (for example, [8]), here we pursue different goals: 1. we don’t abstract away from behavior, but make non-deterministic behavior (a consequence of underspecification in C) explicit in the model; 2. non-deterministic execution sequencesFootnote 2 and variable dependencies are precisely represented in a formal language and amenable to symbolic analysis; 3. the formal model with explicit non-determinism makes it possible to validate the model via automatically generated test cases and to give feedback to the author of the C-code about possibly unintended ambiguity.

Our main contributions are: 1. A behavior-preserving, fully automatic translation of a large fragment of sequential C that explicitly renders all possible execution sequences in ABS, and 2. application and adaptation of the ABS test case generator SYCO [3] to generate validation test cases. In Sect. 2 we define the C-fragment that we currently support and introduce a running example. In Sect. 3 we show how we extract an outline of the model based on the declarations of global variables and functions; how we extend the function-modelling classes with required helper methods in order to make non-determinism contained in C expressions within the function definition explicit in the model; and finally how we model the execution of the function call itself. In Sect. 4 we report on experiments performed with our tool, for model validation. Finally, we discuss related and future work in Sect. 5 and conclude in Sect. 6.

2 C-Fragment and Active Object Language

2.1 Input Language: C

The supported C-fragment is closely related to MISRA-C [12], a C subset widely used in embedded systems. We don’t cover all features of MISRA-C (yet) which is not caused by principal limitations, but down to the fact that our tool is a research prototype rather than a commercial product. More importantly, in contrast to MISRA-C we explicitly permit non-deterministic computations and programs with underspecified C semantics that may lead to different behavior. In fact, our goal is to make such behavior explicit, so that it can be analyzed and taken into account in the parallelization stage.

Figure 2 contains the subset of C we use as an input language to explain our model extraction process.Footnote 3 A program is a list of declarations containing a function definition for . In addition to the assignment operator \(=\), we restrict ourselves to the operator set \(\{~{+}, {-}, {*}, {==}, {!=}, {>}, {>=}, {<}, {<=}~\}\). The semantics of a program from this subset of C are the same as the semantics of the C99 standard for the given program. In particular the unspecified evaluation order for side effects of assignments, as well as evaluation of arguments and subarguments to operatorsFootnote 4 and functions are preserved. Following the standard, evaluation of all function arguments and side effects caused by these is sequenced before the actual function call, while evaluation of arguments and side effects outside of the function call are indeterminately sequenced to it.Footnote 5

Fig. 2.
figure 2

Syntax for a subset of C

figure b

Example 1 We consider an execution of the program in Listing 1.1. Execution of a C program always begins in the function . First, a local variable is initialized with the value . Then the condition of the loop ( ) is evaluated. The C standard imposes no order on the evaluation of the arguments and of the operator \({>}\).

Therefore, either of the following executions follow the standard:

  1. 1.

    is called, setting to while returning the value , then is evaluated to . Finally, is evaluated to ,Footnote 6 thus the condition is deemed false, the loop is exited and the program returns (the value of ).

  2. 2.

    is evaluated to is called, setting to while returning the value . Finally, is evaluated to , thus the condition is deemed true and the loop entered. The expression statement is executed by evaluating the expression. It is ensured that the value and side effect of are evaluated before the function is called. Therefore is set to and is called, setting to (the value of ). Now the condition of the loop is checked again and will evaluate to regardless of evaluation order, thus exiting the loop. The program returns (the value of ).

Execution of the program is thus underspecified, due to implicit non-determinism.Footnote 7

2.2 Output Language: Active Objects

Languages such as Java or C feature low-level concurrency where a thread can be preempted at any time by another process running on the same processor and heap space. This leads to myriads of possible interleavings that cause complex data races being hard to contain and to characterize. On the opposite side of preemptive scheduling is actor-based, distributed programming [16], where all methods are executed atomically and concurrency occurs only among distinct processors with disjoint heaps. In this scenario it is possible to specify behavior completely at the level of interfaces, typically in the form of behavioral invariants jointly maintained by an object’s methods. The drawback is: this restrictive form of concurrency forces one to model and to specify systems at a highly abstract level, essentially in the form of protocols. It precludes modeling of concurrent behavior that is closer to real programs, such as waiting for results computed asynchronously on the same processor and heap.

Recently, active object languages [4] attempt to occupy a middle ground between preemption and full distribution. We focus on ABS [9] which is based on cooperative scheduling and has been used to model complex, industrial concurrent systems [2]. Cooperative scheduling implies that tasks cannot be preempted, but they may explicitly and voluntarily suspend their execution to allow a required result to be provided by another task: concurrent methods on the same processor and heap cooperate with each other to achieve a common goal.

The ABS language construct realizing this behavior has the form await f?, where f is a reference (called future) to the result of a method that may not have completed. Its effect is that the current task suspends itself and only resumes once the value of f is available. However, there might be more tasks except the one computing f’s value waiting for execution at this point. It is not determined in which sequence these waiting tasks are scheduled. Since they share the same memory, data races among them are possible.

Crucially, since the only ABS statement that can suspend execution is await, data races are localized in that they can only occur at await statements (or at the start of a method). Likewise, since all ABS methods run uninterruptedly either to completion or until they encounter an await statement, only the final state reached at the end of a method or before an await statement needs to be known when analyzing local data races. Hence, it suffices to reason about a very specific form of data race at few, explicitly specified code locations.

figure au

Given a program from our C subset we extract an \(\text {ABS}_{lite}\) model from it. Figure 3 shows the syntax of \(\text {ABS}_{lite}\).Footnote 8 For a brief overview of the semantics of \(\text {ABS}_{lite}\), consider the model in Listing 1.2. The main block at the end is executed when the model is run. A new object of class is created with an initial value of 5 for the implicitly defined field . Then two asynchronous calls are made to the object : one call to add 2 to the field and one call to return the value of field . An asynchronous call immediately returns a future value, which can be polled through an statement to see if the method call has returned. The statement ensures that no further code in the main block is executed until both asynchronous calls have returned. In the meantime the active object has received the two asynchronous calls. It begins to execute one of these calls. Once that call has returned, it will execute the other. Depending on the order it executes these calls, the value returned by is either 5 or 7. The returns the value of a future, blocking if neccessary until the value is available. Here the ensures that the return value from the call to is available. It is stored in the local variable . Through the explicit non-determinismFootnote 9 of active objects (realized by the two asynchronous calls) the value of is underspecified.

Fig. 3.
figure 3

Syntax for \(\text {ABS}_{lite}\)

3 Model Extraction

An overview of the model extraction process is in Fig. 4. Each function definition is modelled as a class, while each executing function call is modelled as an active object of that class. Evaluation of (sub)expressions and side effects take place in asynchronous method calls to the same active object, while statements at which forked asynchronous calls are joined model the sequencing rules of the C standard. If a function is called multiple times (whether recursively or iteratively), each of these calls is modelled by its own active object. As all functions have access to the global variablesFootnote 10, a single active object which all other active objects have access to is used to model the state of all global variables. Blocking calls to the global object are used to access/modify the global variables. Additionally, blocking calls are used to pass control from one function call to a nested function call being executed, as the C standard ensures that subexpressions and side effects outside of a function call are indeterminately sequenced to it and, therefore, cannot occur during execution of the function call.

Fig. 4.
figure 4

Overview of model extraction

3.1 Modelling Global Variable Declarations and Initial Call to Main

Given a program p we construct the model shown in Listing 1.3. The function \( extractFunctions \) is described in Sect. 3.3 and \( extractGlobalVars \) is defined to create a class , which contains all global variables with their initial values as fields, with getter and setter methods for these fields:

figure bm

In the main block, we create an active object of class and pass this to an active object modelling the program entry. Whenever new active objects modelling function calls are created, we pass the object along, such that every modelled function call has access to the global variables. As an example, Listing 1.4 shows the extracted Global class from Listing 1.1.

figure bp

3.2 Modelling Unspecified Evaluation Order Within Expressions

Evaluating an expression in C can exhibit unspecified behaviour due to the lack of a rigid evaluation order for subexpressions and side effects offered by the typical C standards (as opposed to, e.g., the Java language specification). To correctly model this unspecified behavior, we take advantage of the explicit non-determinism of active objects with respect to the execution order of asynchronous calls. Execution of a function call in C is modelled by an active object executing its method. Within this method multiple asynchronous calls can be made to other methods of this active object followed by an statement, such that these other methods can be executed in a non-deterministic fashion.

Definition 1

A tuple \(( stmts , se , futVar ) \in ( VarDecl_a ^* \times VarId_a ^* \times VarId_a )\), where \( se \) contains only local variables of type declared in \( stmts \) (the side-effects of the evaluated expression) and \( futVar \) is a local variable of type declared in \( stmts \) (the value of the evaluated expression) is defined as an expression wrapperFootnote 11. The set of all expression wrappers is defined as \(\mathbb {EW}\).

We define the function \( convert \) in Fig. 5, which converts a C expression into an expression wrapper recursively, where are fresh unused identifiers, \(e_i \in Expr_c , z \in \mathbb {Z}, lv \in LocalId_c , \textit{gv}\in GlobalId_c , ( stmts _i, se _i, x_i) = convert (e_i), {\oplus } \in Operator \) and \(f \in FuncId_c \).

Fig. 5.
figure 5

The function \( convert : Expr_c \rightarrow \mathbb {EW}\)

Fig. 6.
figure 6

Families of required helper methods

As can be seen in the function \( convert \), asynchronous calls to various methods of the current active object are made. The active object classes generated from a C function are thus required to implement the subset of methods in Fig. 6 which are used in the converted expression wrappers of all expressions contained in the function definition.

Side effects are created only by assignments, while the side effects of an operator’s operands are gathered and passed upwards. A function call has no side effects in this senseFootnote 12, but rather introduces a sequence point between evaluation of function arguments and any side effects produced therein, and the function call itself. For this reason the call to \( call \_f\_m\) contains the future values for all side effects of the function arguments, in addition to the arguments themselves. This allows an statement to ensure that all side effects are completed, before the actual call to the function is modelled by creating a new active object of the appropriate type and calling its method.

3.3 Modelling Function Definitions as Classes

The function \( extractFunctions \) called in Listing 1.3 extracts \(\text {ABS}_{lite}\) classes modelling C function definitions and is defined in Fig. 7, together with \( extractFunction \) and \( extractLocalVars \). Here \(( stmts ', se '_1 \cdots se '_n, x') = convert (e)\) and \( extractStmts \) (and helper functions \( extract \) and \( varDeclToAssign \)) are defined in Fig. 8.

Fig. 7.
figure 7

The functions \( extractFunctions \), \( extractFunction \) and \( extractLocalVars \)

Function parameters are modelled as class parameters (which are implicit fields), while local variables are modelled as explicit fields of the class. This allows access to them as required from the helper methods. For this reason a local variable declaration needs to be treated twice: once by creating a field to model this local variable and assigning it a witness term (Int  lv = 0;) in \( extractLocalVars \) and once by modelling the initial value for the local variable by assignment (this. lv = x’.get;) in \( extract \).

Fig. 8.
figure 8

The functions \( extractStmts \), \( extract \) and \( varDeclToAssign \)

Treating loops introduces an additional wrinkle: while in C the condition of a while loop can contain side effects, in ABS this is not possible. For this reason the auxiliary statements in the expression wrapper required to calculate the value of the pure expression must be performed twice: once before the loop and once at the end of the loop body before re-evaluating the condition. We re-use the local variables declared in the auxiliary statements by replacing local variable declarations with assignment in \( varDeclToAssign \).

4 Experiments

We developed an Eclipse plugin C2ABS which extracts an ABS model from a given C program, following the translation approach described in the previous sections.Footnote 13 To validate an extracted model we analyze it with SYCOFootnote 14, a systematic tester for ABS concurrent objects. The SYCO kernel includes state-of-the-art partial-order reduction techniques to avoid redundant computations during testing [3]. Two runs of an ABS program with the same main method are redundant relative to each other when any possible difference in the scheduling of tasks cannot possibly lead to a data race. Obviously, this is an undecidable property. SYCO safely under-approximates redundant computations.

Table 1 contains C programs that contain expressions with unspecified evaluation order. The programs two-unspec, Schrödinger and one-to-fib are based on an idea by Derek JonesFootnote 15, where the C standard allows two-unspec to return either 1 or 2, Schrödinger tests if two calls to two-unspec are equal and one-to-fib(n) returns a value between 1 and the n-th Fibonacci number. Too many false positives are often a problem with static code checkers, so no-reliance is a test case which does not rely on unspecified evaluation order, calculating the same result despite different execution paths. Finally, assign-chain returns , where returns the sum , to test unspecified evaluation order of side effects.

We compared the result of model extraction with C2ABS followed by analysis with SYCO to program analysis using CerberusFootnote 16, a tool for developing a semantic model for a substantial fragment of C [11]. It takes a similar approach than we do by cross-compiling C into a Lisp dialect and performing analysis on that program. Table 1 contains the number of explored states during analysis and the total time spent for the SYCO web interface. The Cerberus web interface has a 45 second timeout and does not give exact run times. We also show the different possible results for the programs and the number of execution paths deemed different by the tools. In the case of SYCO, it shows only those executions that lead to a different configuration after partial order reduction [1].

Table 1. Model validation with SYCO compared to program analysis with Cerberus

While Cerberus times out after 45 seconds for one-to-fib(4), SYCO manages to completely validate the model extracted by C2ABS in less than a second. SYCO recognizes that there are only 4 different paths in the Schrödinger model, while Cerberus claims 98. But most interesting are the different results for assignment-chain: here the difference seems to be that Cerberus assumes the order of the side effects is set (first assign z, then y, then x) and only allows the evaluation of f() to interleave. However, this does not match the C standard which clearly states that the evaluation order of side effects is unspecified. Our model faithfully reflects this, allowing the side effects and function call to occur in any order, resulting in additional possible results.

In addition to the C programs where SYCO could fully analyze the extracted model, we considered programs where the extracted model caused SYCO to time out after 45 seconds when attempting to analyze all possible execution paths. The one-to-fib function for inputs greater than 4 is such a case, as well as a nested loop example with 10,000 inner iterations. Partial validation of these larger models was possible, by enabling constraints in SYCO to only consider certain paths, and by using a simulation tool that creates an Erlang program from an ABS model and executes that.Footnote 17 With these we can partially validate one-to-fib with inputs up to 19 in less than 10 seconds.

5 Related and Future Work

We discussed the Cerberus tool in the previous section. Apart from it, there is not much published work on model extraction. The SPIN model checker contains the model extractor Modex from C to ProMeLA [8]. Unfortunately, we did not manage to get it to work on our examples. MISRA-C is a well-known subset of the C language widely used in the development of safety-critical systems [13]. One of its rules checks whether the value of an expression is the same under any order of evaluation that the standard permits. It stipulates that no unspecified behavior is caused by the order of evaluation of subexpressions. There are several, mostly commercial, static code analyzers equipped with a MISRA-C compliance checker, for example , Astrée [6], PolyspaceFootnote 18, Axivion Bauhaus Suite [15], and ECLAIRFootnote 19. All of these are based on abstract interpretation [5]. Also, some compilers like Green Hills , IAR , TASKING and TI are equipped with a MISRA-C compliance checker. In contrast to MISRA-C compliance checkers we want to analyze and detect also non-compliant behavior and we give detailed feedback to the developer about differing computations.

In the future we intend to add operators that introduce sequencing (in particular the ternary operator), as well as tracking sequencing information to recognize undefined behavior, such as changing a value multiple times between sequence points. We will also extend the types C2ABS can deal with. ABS has a formally defined semantics [9], while a semantics for C is given by the K frameworkFootnote 20, allowing a formal proof of the correctness of the translation in future. Common continuation region analysis [10] allows recognizing and optimizing asynchronous calls which can be performed in parallel. Finding parallelization potential in the ABS model could then be transferred back to the C program.

6 Conclusion

We described how to extract an ABS model from a C program to make the implicit non-deterministic behavior explicit. There exist a number of tools built to analyze ABS models [17], because the language was designed to be analyzable. This will help us extend the ABS toolbox with tools built to localize parallelizable parts of the model and thus give feedback to the C developers. We implemented our model extraction approach and validated the models thus extracted using SYCO. In doing so, we have found differences in results between our modelling of the C standard and that chosen by developers of the related tool Cerberus. We feel confident that our results are correct. Our approach also seems to scale better. Additionally, we found areas where SYCO can be optimized and relayed this to the developers.