Lingua Diabolis | Dreams in CodeQL - Quest for the Perfect GOTO

Dreams in CodeQL - Quest for the Perfect GOTO

On 21. February Qualys Threat Research Unit (TRU) posted the details of two vulnerabilities of OpenSSH to the Full Disclosure and oss-security mailing lists. The starting point of the presented research was a typical error handling pattern in the OpenSSH code base. TRU's idea was that the pattern is easy to get wrong, so it's worth to look for variants that may result in undefined behavior potentially leading to exploitable vulnerabilities. The message included a CodeQL query with the following comment:

Warning: our rudimentary CodeQL query (below) might hurt the eyes of experienced CodeQL programmers; if you, dear reader, are able to write a query that runs faster or produces less false positives, please post it to the public oss-security mailing list!

While I won't call myself an "experienced CodeQL programmer" I really enjoy unusual bugs and solving code analysis problems, so I decided to take on this challenge. Before I got to this problem Jordy Zomer also posted a solution, so this post will be a good opportunity to learn from different approaches.

Understanding Qualys' Query

While I highly recommend to read the original post by Qualys, as a reminder here's the sample code they provided to demonstrate the potential issue:

/*1387*/ int
/*1388*/ sshkey_to_base64(const struct sshkey *key, char **b64p)
/*1389*/ {
/*1390*/         int r = SSH_ERR_INTERNAL_ERROR;
/*....*/
/*1398*/         if ((r = sshkey_putb(key, b)) != 0)
/*1399*/                 goto out;
/*1400*/         if ((uu = sshbuf_dtob64_string(b, 0)) == NULL) {
/*1401*/                 r = SSH_ERR_ALLOC_FAIL;
/*1402*/                 goto out;
/*1403*/         }
/*....*/
/*1409*/         r = 0;
/*1410*/  out:
/*....*/
/*1413*/         return r;
/*1414*/ }

In summary, the potential problem is that if the developer forgets to set the value of r at line 1401, a return value (indicating success) produced at line 1398 may be returned by the function despite the detected error condition.

The query posted by Qualys is the following:

from Function f, ReturnStmt r, LocalVariable v, IfStmt i, Stmt b
where f.getBlock().getLastStmtIn() = r and
      exists(VariableAccess a | a.getTarget() = v and not a.isModified() and a.getEnclosingStmt() = getRecChild(r)) and
      v.getType() instanceof IntType and
      i.getEnclosingFunction() = f and
      i.getThen() = b and
      (b instanceof GotoStmt or b instanceof BlockStmt and ((BlockStmt)b).getLastStmtIn() instanceof GotoStmt) and
      not exists(Assignment a | a.getEnclosingFunction() = f and a.getLValue().toString() = v.getName() and a.getEnclosingStmt() = getRecChild(i)) and
      not exists(Assignment a | a.getEnclosingFunction() = f and a.getLValue().toString() = v.getName() and a.getEnclosingStmt() = getRecChild(b)) and
          exists(Assignment a | a.getEnclosingFunction() = f and a.getLValue().toString() = v.getName() and
            a.getLocation().toString() != v.getDefinitionLocation().toString() and
            a.getBasicBlock().getANode() = getRecPredecessor(i.getBasicBlock().getANode()))
select f, b.getLocation().toString()

To find optimizations we first have to understand what this original code does, so I started out by commenting each line. The following subsections describe what I figured out, line-by-line.

Variable Declarations

The from clause simply defines the variables/nodes we're about to use, but single-letter variable names can be a bit unclear:

Function f - The function that contains all the code we want to analyze. We only care about "local" code structures, we won't analyze cross-function data.
ReturnStmt r - The return statement that is affected by our return variable.
LocalVariable v - The local variable that is to be returned.
IfStmt i - An if conditional that gotos to the return statement without setting v. I'll refer to such conditions as "faulty if"s throughout this writeup.
Stmt b - This is either the goto statement inside i or the body block of the i conditional that ends in a goto.

The above descriptions were of course deduced from the where clause, let's see what that says:

I. Filtering Returns

f.getBlock().getLastStmtIn() = r

This condition filters the return values that are last in the function's body code block. In theory this can lead to false negatives as functions can return from any line, but in practice error handling is usually written at the end of functions, so this looks like a reasonable optimization.

II. Connecting the Variable and the Return Statement

exists(VariableAccess a | a.getTarget() = v and not a.isModified() and a.getEnclosingStmt() = getRecChild(r))

This condition states that our return variable v must be accessed in some way in the return statement r. The getRecChild() is a custom recursive predicate that in this case represents all children of the return statement:

Stmt getRecChild(Stmt s) {
  result = s or
  result = getRecChild(s.getChildStmt())
}

III. Restricting Return Variable Type

v.getType() instanceof IntType

This simple condition restricts the return variable to integers. While other code bases that rely on the same error handling pattern may use different data types, this conditions should be easy enough to tweak.

IV. Tying the Conditional to the Function

i.getEnclosingFunction() = f

This simple statement ties the if statement to the function.

V. Tying GOTO to the Conditional

i.getThen() = b and
(b instanceof GotoStmt or b instanceof BlockStmt and ((BlockStmt)b).getLastStmtIn() instanceof GotoStmt)

The first condition states that b should be on the "then" branch of our i conditional. While CodeQL's cpp module has a dedicated GotoStmt class, b was declared with the more generic Stmt (statement) type, so the second conditional can match both:

Single goto statements in the if body (e.g. if(...) goto fail;)
A sequence of statements that end in a goto (e.g. if(...) { foo(); goto fail;})

VI. Excluding Conditions that Set the Return Variable

not exists(Assignment a | a.getEnclosingFunction() = f and a.getLValue().toString() = v.getName() and a.getEnclosingStmt() = getRecChild(i))

This condition uses the custom getRecChild() predicate to exclude if statements that assign a value to the v local variable in the conditional (condition or body blocks). Note that string-based comparisons shouldn't be necessary in a condition like this. Later we will see the shortcoming of similar constructs and a cleaner solution.

VII. Excluding Branches that Set the Return Variable

not exists(Assignment a | a.getEnclosingFunction() = f and 
                          a.getLValue().toString() = v.getName() and 
                          a.getEnclosingStmt() = getRecChild(b)) and

This one is very similar to the previous, except it checks if the return value is not assigned after the goto statement. Based on my tests this condition has no effect as getChildStmt() used in the getRecChild() predicate only operates within the statement passed to the predicate, so it won't traverse to the basic block(s) after the target label (which I assume the original intention might have been).

VIII. Return Value Reassignment

exists(Assignment a | 
    a.getEnclosingFunction() = f and 
    a.getLValue().toString() = v.getName() and
    a.getLocation().toString() != v.getDefinitionLocation().toString()
    a.getBasicBlock().getANode() = getRecPredecessor(i.getBasicBlock().getANode())
)

This one took me a while! Once again, we look for Assignments that set the value of our return variable v. We also use the custom recursive predicate getRecPredecessor() that collects all predecessors of i in the control-flow graph:

ControlFlowNode getRecPredecessor(ControlFlowNode n) {
  result = n or
  result = getRecPredecessor(n.getAPredecessor())
}

We check if any of the preceding basic blocks assign a value to v, and we make an exception with the variable's original definition. In short we are looking for functions that assign a new value to the return variable after its definition but before the faulty if condition is executed.

Summary of Qualys' Query

The analyzed conditions can be summarized with the following "visual" code pattern:

int X(/*...*/){ // We have a function X that (probably) returns int
  int v = CONST_FOO; // (III.) We have local variable in X that is definitely an int
  //...
  v = CONST_BAR; // (VIII.) Our local variable is reinitialized in X after its definition, 
                 // before the faulty conditional
  //...
  if (not_assigning_v()) { // We have an (faulty) if conditional in X (IV.) that doesn't set v (VI.)
    // ... 
    goto fail;  // The if body ends with a goto (V.)
  }
  // ...
  fail:
    // ...
    return BAZ(v); // The last statement is a return that uses v (I., II.)

}

Optimized Solution

As I was figuring out the nuts and bolts of the original query I also executed it on the OpenSSH 9.9.p1 code that is discussed by Qualys to see if I can reproduce their results. I also executed the query on SQLite (a very "tidy" code base) and OpenSSL (a pretty large project). While the original results were easily reproducible, analyzing OpenSSL showed that the query is inefficient on larger targets - I gave up after an hour spent in getRecPredecessor().

Based on these observations I set the following goals for myself:

Reproduce the original results as closely as possible. Qualys mentions that they identified no false negatives with manual analysis, which seems like an important property of the original query.
Optimize the query so it executes on OpenSSL in "reasonable" time.

My first idea was to translate the original CodeQL query to Semgrep. Semgrep excels at simple and fast AST matching, so it seemed like a good choice to overcome performance issues (Semgrep also has data-flow and taint tracking features, but I never had much luck with those). The original query also described the structure of the target code rather than more abstract data- or control-flow (even the recursive control-flow predicates were used to describe the ordering of statements), so there were no obvious blockers to the translation either.

But as I started to work on my Semgrep query I noticed a growing number of edge cases making the query highly convoluted with multiple levels of metavariable nesting. This was the point when I realized that we're actually dealing with a data-flow problem: we want to reason about the data held by our return variable from its initialization to the last return statement.

Data-Flow Analysis

In CodeQL such problems can be modeled with local data-flows:

Local data-flows only consider a single function, sparing significant resources compared to cross-function global flows.
Data-flows track value-preserving operations. This approach makes more sense than taint-tracking, where the analyzer also follows operations where the original data only affects some other value. It would be quite strange to perform e.g. arithmetic operations on a status code.

My query started out like this:

from DataFlow::Node sinkRet, DataFlow::Node sourceDef, ReturnStmt ret, LocalVariable v,
where
  // There is an untainted dataflow from a variable assignment to the return
  sourceDef.asExpr() = v.getAnAssignment() and
  sinkRet.asExpr() = ret.getAChild() and
  DataFlow::localFlow(sourceDef, sinkRet)

A useful property of getAnAssignment() is that it doesn't return the initializer statement of the variable, sparing some of the implementation present in VIII.

These conditions by themselves don't make much sense as they just specify that we are interested in data-flows from local variable assignments to a return expression. In this attempt I used these statements as a core to build my data-oriented query and added fragments of the original query to shape the results to match my goals.

Ground Work

Let's introduce other main actors of our little play in the where clause:

Function f - The containing function.
LabelStmt gotoLabel - The label of the goto target.
IfStmt i - The faulty if conditional.
Stmt goto - The if body containing the goto or the goto itself, leading to the return statement.

With these, we can immediately express some basic conditions that will limit the search space of the analyzer and tie other variables to the Function (note that we don't know what indexes CodeQL uses internally, but it just makes sense to discard a large chunk of nodes from the search early e.g. on type mismatch):

// Function return type
f.getType() instanceof IntType and
// Variable return type (matches Function return type)
v.getType() instanceof IntType and
// IfStmt is in Function
i.getEnclosingFunction() = f and
// v is in the same Function
v.getFunction() = f and 
// Only named nodes 
// This is required because some fallthrough paths are interpreted as anonymous labels by CodeQL. 
gotoLabel.isNamed()

I also reused some conditions directly from the original query:

  // The return statement includes v (II.)
  exists(VariableAccess a | a.getTarget() = v and 
                            not a.isModified() and 
                            a.getEnclosingStmt() = getRecChild(ret))

Tying Control- and Data-Flows

So far our data-flow is independent from any conditionals. I used gotoLabel as the "glue" between the data-flow and "faulty" conditionals. First, we tie the label of the goto to the return sink:

Stmt getRecSuccessor(BasicBlock s) {
  result = s or
  result = getRecSuccessor(s.getASuccessor())
}

// return statement is in the labels basic block 
getRecSuccessor(gotoLabel.getBasicBlock()) = ret.getBasicBlock() and

I defined a new recursive function that will discover all basic blocks reachable (based on the CFG) after the return label. My assumption here is that labels for a return won't have many basic blocks, so this predicate will be fast in real-world targets.

Then I tweaked condition V. from the original query so I can narrow the set of if conditionals to ones that jump to the return label:

  // The if statement executes our goto statement or block 
  i.getThen() = goto and
               ( (goto instanceof GotoStmt and ((GotoStmt)goto).getName()=gotoLabel.getName()) or 
                 (goto instanceof BlockStmt and ((BlockStmt)goto).getLastStmtIn() instanceof GotoStmt and 
                 ((GotoStmt)(((BlockStmt)goto).getLastStmtIn())).getName() = gotoLabel.getName())
               )

This condition looks complex, but if you erase the type casts in your head it's actually quite simple, following the same logic as V.

With these two conditions we can get if statements that jump to the return label that is part of a relevant data-flow. Next task was to filter conditionals that spoil our fun by clobbering the return value.

Filtering Well-Behaving Conditionals

Condition VI. - used to exclude ifs that modify the return value - caused an unexpected amount of head scratching. In the end I ended up with this condition:

  // v is not accessed in i 
  not exists(VariableAccess a | a.getEnclosingFunction() = f and a.getTarget() = v and 
                            a.getEnclosingStmt() = getRecChild(i)) and

The first thing to notice is that I use the VariableAccess instead of the original Assignment type for filtering if conditions.

When we want to capture a statement like a = b; both VariableAccess and Assignment classes seem like reasonable choices, and the relevance of this split was unclear to me initially. Then I encountered a false positive result similar to the following:

if (some_condition()){
  ++return_value;
  goto end;
}

The problem with this result is that the conditional changes the return value without being an actual assignment. This can be easily solved by using the VariableAccess class and its isModified() predicate to check if our return variable is touched in the body of the conditional. However, I went a step further and excluded any access to the variable in the conditional to prevent false positives like this:

r=some_helper();
if (r == STATUS_FAIL) 
  goto return_error;

This is clearly a deviation from the original logic, but as we will see in the Evaluation section, it doesn't seem to have negative effects.

Precedence?

The original query prescribed the return value to be (re)assigned before the control-flow hits a faulty if, but we haven't included this factor in the new query yet. As usual, I tested the query continuously to see what works and what doesn't. At this stage I noticed that results given for OpenSSH were really close to the true positives identified by Qualys, so I took a closer look to see what patterns result in noise. I found that some functions set their return value after the return label like this:

function foo(){
  // ...
  return_label:
    r = some_condition() ? 1:0;
    return r;
}

The lines after the label match our data-flow condition and since there are no constraints on precedence, any if's before the label will match too, yielding false positives. As a quick fix I added the following condition that makes sure that the return variable isn't modified after the return label:

  // Note: This doesn't handle if v is used as an output argument to a function
  not exists (VariableAccess a | a.getTarget() = v and a.isModified() and a.getEnclosingStmt().getBasicBlock() = getRecSuccessor(gotoLabel))

Surprisingly, with this condition I managed to get a result really close to my original target ("sometimes science is more art than science")! I'll go into details and ponder why this thing works in the Summary subsection. I haven't spent more time on this code as Jordy Zomer's solution gave me some new ideas. In the next main section I'll briefly analyze Jordy's solution, then present my final query.

Summary of the Optimized Solution

The logic of my optimization attempt can be summarized with the following pseudo-code:

/*    */ int X(/*...*/) {
/*D.F.*/   // ...
/*a/- */   v = CONST_FOO; // v is assigned here ...
/* |  */   // ...
/* |  */   if (not_using_v()) { // This is our query result
/* |  */     // v is not used here 
/* |  */     goto fail;
/* |  */   }
/* |  */   // ...
/*b|- */   v = CONST_FOO; // ... or here ?!
/* |  */   // ...
/* |  */   fail:
/* |  */     // ... No assignment to v here ... 
/*  ->*/     return BAZ(v);
/*    */ }

This illustration also helps explain how this work-in-progress piece of code could yield meaningful results. Our concern is that we identify an if that is unrelated to the result of the function, e.g. some logging macro where we don't want to report errors yet. But notice that v is not assigned after the fail label, yet our if does jump to that label. Since the if doesn't assign the return value either, v must have some value by the time the conditional is executing (otherwise we found an uninitialized variable use that is also a win). This is only possible if v is assigned before the if. The only quirk is that this assignment can be the initialization of the variable: since OpenSSH initializes return variables to safe error values, this means false positives. In other code bases, like OpenSSL, (where API schematics are not that consistent...) we can still stumble upon interesting results. E.g. some functions in OpenSSL don't return an error code but the number of errors encountered, so mistakenly keeping the 0 initialization value can cause unexpected behavior - but this is a story for a different post...

The query looks like this in whole:

Stmt getRecChild(Stmt s) {
  result = s or
  result = getRecChild(s.getChildStmt())
}

Stmt getRecSuccessor(BasicBlock s) {
  result = s or
  result = getRecSuccessor(s.getASuccessor())
}

from DataFlow::Node sinkRet, DataFlow::Node sourceDef, ReturnStmt ret, LocalVariable v, LabelStmt gotoLabel, Function f, IfStmt i, Stmt goto, Assignment ass
where 
  // Function return type
  f.getType() instanceof IntType and
  // Variable return type matches Function return type
  v.getType() = f.getType() and
  // IfStmt is in Function
  i.getEnclosingFunction() = f and
  // v is in the same Function
  v.getFunction() = f and 
  // Only named nodes (e.g. no default returns)
  gotoLabel.isNamed() and
  v.getAnAssignment() = ass and
  // The return statement includes v
  exists(VariableAccess a | a.getTarget() = v and not a.isModified() and a.getEnclosingStmt() = getRecChild(ret)) and
  // return statement is in the labels basic block 
  getRecSuccessor(gotoLabel.getBasicBlock()) = ret.getBasicBlock() and
  i.getThen() = goto and // The if statement executes our goto statement or block
       ((goto instanceof GotoStmt and ((GotoStmt)goto).getName()=gotoLabel.getName()) or 
        (goto instanceof BlockStmt and ((BlockStmt)goto).getLastStmtIn() instanceof GotoStmt and 
         ((GotoStmt)(((BlockStmt)goto).getLastStmtIn())).getName() = gotoLabel.getName())) and // goto is either a single GOTO statement
                                                                                               // or a block that ends with GOTO 
                                                                                               // and goto label matches the return label
  // v is not accessed in i
  not exists(VariableAccess a | a.getEnclosingFunction() = f and a.getTarget() = v and 
                            a.getEnclosingStmt() = getRecChild(i)) and    
  // v is not modified in the basic block of the label
  // Note: This doesn't handle if v is is used as an output argument to a function
  not exists (VariableAccess a | a.getTarget() = v and a.isModified() and a.getEnclosingStmt().getBasicBlock() = getRecSuccessor(gotoLabel)) and
  // There is an untainted dataflow from a variable assignment to the return
  // Note: This will miss ternary assignments in return if returned expressions don't include v  (`return ret==8?0:1;`) 
  sourceDef.asExpr() = ass and
  sinkRet.asExpr() = ret.getAChild() and
  DataFlow::localFlow(sourceDef, sinkRet)
select i, 
     "Function: "+f+" - Label: "+gotoLabel.getName()+" - Variable: "+v+"- Assignment: "+ass.getLocation().getFile().getBaseName()+"("+ass.getLocation().getStartLine()+":"+ass.getLocation().getStartColumn()+")"

As you can see, this is basically a mammoth query, and even the few predicates present could be replaced with transitive closures. Let's see Jordy's code already, who shows us how things should be done!

Jordy Zomer's Solution

Based on his code Jordy Zomer approached the problem from ground-up instead of trying to tweak the original query. The code is also well-structured, demonstrating the expertise Jordy has in this field and also helping in making sense of the query.

Fist we see a class definition that will help us find flows that return "OK":

class CallReturningOK extends FunctionCall {
    CallReturningOK() {
     exists(ReturnStmt ret | this.getTarget() = ret.getEnclosingFunction() and ret.getExpr().getValue().toInt() = 0)
    }
}

Note that the assumption here is that success status ("OK") is represented by the value 0, which is a restriction compared to the original query.

The next class is the gist of the code based on the StackVariableReachability superclass:

class GotoWrongRetvalConfiguration extends StackVariableReachability {
    GotoWrongRetvalConfiguration() { this = "GotoWrongRetvalConfiguration" }

    // Source is an assigment of an "OK" return value to an access of v
    // To not get FP's we get a false successor
    override predicate isSource(ControlFlowNode node, StackVariable v) {
        exists(AssignExpr ae, IfStmt ifst | ae.getRValue() instanceof CallReturningOK
        and v.getAnAccess() = ae.getLValue() and  ifst.getCondition().getAChild() = ae  and
        ifst.getCondition().getAFalseSuccessor() = node)

    }

    // Our intermediate sink is a `goto _` statement, but it should have a successor that's a return of `v`
    override predicate isSink(ControlFlowNode node, StackVariable v) {
        exists(ReturnStmt ret | ret.getExpr() = v.getAnAccess() and
        node instanceof GotoStmt and node.getASuccessor+() = ret.getExpr())
    }

    // We don't want `v` to be reassigned
    override predicate  isBarrier(ControlFlowNode node, StackVariable v) {
        exists(AssignExpr ae | ae.getLValue() = node and v.getAnAccess() = node)
    }
}

I didn't know StackVariableReachability before, probably because it is not really discussed outside the API docs. This class exposes predicates similar to ones available for global data-flow analysis, but reachability is analyzed on the control-flow graph. Since comments mostly explain what this class does, I'll just go over some less obvious details here.

We should take a look at isSink() first as it will be important to understand isSource() - the comment here describes the predicate well enough.

isSource() limits return value assignments to ones happening inside if condition - this is also stricter than the original query. To understand the "false successor" requirement it's easiest to use some pseudo-code:

/* BB0 */ if (v=foo()) 
/* BB1 */ {
/*     */   //...
/*     */   goto return_label;
/*     */ }
/* BB2 */ bar();
/* ... */ // ...
/* BBn */ return_label:
/*     */   return v;

Return value assignment happens in BB0. The true path is BB1, that also contains a goto that is a sink, so if we included true successors, we'd end up with a BB1->BB1 path. This would be a false positive, because v is assigned "in the context of" the goto (either in BB0 or earlier in BB1), most likely producing the expected return value.

The class also defines an isBarrier() predicate that excludes all control-flow paths that assign a value to the return variable v.

All there's left is to make use of these types in a query:

from ControlFlowNode source, ControlFlowNode sink, GotoWrongRetvalConfiguration conf, Variable v, Expr retval
where
// We want a call that can `return 0` to reach a goto that has a ret of `v` sucessor
conf.reaches(source, v, sink)
and
// We don't want `v` to be reassigned after the goto
not conf.isBarrier(sink.getASuccessor+(), v)
// this goes from our intermediate sink to retval
and sink.getASuccessor+() = retval
// Just making sure that it's returning v
and exists(ReturnStmt ret | ret.getExpr() = retval and retval = v.getAnAccess())
select retval.getEnclosingFunction(), source, sink, retval

Note that the little trick in isSource() hides the faulty if. To match results with ones of the original query I added an IfStmt to the final query with the following condition:

from IsStmt faultyIf // ...
where
// ...
and faultyIf = sink.getEnclosingStmt().getEnclosingBlock().getParent()
select faultyIf

Refined Solution

Having spent considerable amount of time compiling this writeup I started to recognize common features of the three queries and some of their inefficiencies too. I realized, that while my idea of using data-flow is not bad, it'd be a mistake to think about control-flow analysis as a "post-processing" step. Instead, I should identify data- and control-flow sub-problems, and solve each one with the appropriate tool. With this in mind the high-level pattern looks surprisingly simple:

+-------+
|       |
|def. v +----Control---+
|       |              |
+---+---+         +----v----+
    |             |faulty if|
    |             +----+----+
   Data                |
    |                  |
+---v-----+            |
|         |            |
|return v <---Control--+
|         |
+---------+

As mentioned earlier, local data-flow analysis is a perfect tool to track the undisturbed flow of v
The "faulty if" must be connected with control-flow to the source and sink of the data-flow
- This implies precedence
- Data-flow analysis ensures that the return value is not modified on the identified control path

The resulting query further demonstrates the idea:

from ControlFlowNode sourceInit, ControlFlowNode sinkIf, InitToFaultyIfConfiguration confInitIf, LocalVariable v, ReturnStmt ret,
DataFlow::Node sinkRet, DataFlow::Node sourceDef
where confInitIf.reaches(sourceInit, v, sinkIf) and // A local variable reaches a faulty if in the CFG
  isProperReturn(v, ret) and // the variable is used as part of the return
  sourceDef.asExpr() = v.getAnAssignment() and // we are looking for data-flows from the return variable ...
  sinkRet.asExpr() = ret.getAChild() and // ... to the retrun statement
  DataFlow::localFlow(sourceDef, sinkRet) // we are only interested in value-preserving, local data-flows
select sourceInit, sinkIf, sinkIf.getLocation().getFile().getBaseName()+":"+sinkIf.getLocation().getStartLine()+":"+sinkIf.getLocation().getStartColumn()

We have a control-flow tracking configuration, inspired by Jordy's use of StackVariableReachability. We also use DataFlow::localFlow() for data-flow analysis. The two flows are tied together by the local variable (v). StackVariableReachability tracks the same variable that is tracked by localFlow().

Here's the commented source of the control-flow tracking configuration:

class InitToFaultyIfConfiguration extends StackVariableReachability {
    InitToFaultyIfConfiguration() { this = "InitToFaultyIfConfiguration" }

    override predicate isSource(ControlFlowNode node, StackVariable v) {
        // We are interested in all variable accesses
        // Note: Initializers are not VariableAccess!
        exists(VariableAccess ae | v = ae.getTarget() and ae.isModified() and ae = node) 

    }

    override predicate isSink(ControlFlowNode node, StackVariable v) {
        exists(ReturnStmt ret, GotoStmt goto, IfStmt i | node = i and // Source is an IfStmt
               i.getEnclosingFunction() = v.getAnAssignment().getEnclosingStmt().getEnclosingFunction() and // IfStmt and StackVariable are in the same function
               isProperReturn(v,ret) and // Return statement accesses v
               notChildOfIf(v, i) and // return variable not part of this if statement
               ifToGoto(i, goto) and // if leads to goto
               goto.getASuccessor+()=ret // goto leads to relevant return
        ) 
    }

    override predicate  isBarrier(ControlFlowNode node, StackVariable v) {
        exists(VariableAccess ae | v = ae.getTarget() and ae.isModified() and ae = node)  
    }
}

The source predicate is pretty simple: we just take any node that assigns a new value to a stack variable.

The sink predicate is more tricky, because it establishes the control-flow link from "faulty if" to the return. Following the logic described at the "optimized" solution we also exclude ifs that access the stack variable in any way. Since the ret return statement here is not necessarily the same as we found in the where clause we use the custom isProperReturn() predicate again to make sure the jump is relevant.

The code in isBarrier() is redundant with data-flow constraints, but since this is a mandatory predicate to implement I left it in the code.

The custom predicates are defined as follows:

// variable is not mentioned inside if
predicate notChildOfIf(Variable v, IfStmt i){
    not exists (VariableAccess a | a.getTarget()=v  and i.getEnclosingStmt().getAChild*()=a.getEnclosingStmt())
}

// if leads to goto on its true path
predicate ifToGoto(IfStmt i, GotoStmt g){
    g.hasName() and
    i.getThen().getAChild*()=g
}

// Return statement accesses v
predicate isProperReturn(Variable v, ReturnStmt ret){
    exists(VariableAccess a | a.getTarget() = v and not a.isModified() and a.getEnclosingStmt() = ret.getChildStmt*())
}

While these predicates look simple enough now, most of my time was spent on debugging them! CodeQL's API terminology is not trivial, and I'll probably dedicate a whole other writeup about what I found out during development, but this post is long enough already, so let's jump to Evaluation (edit: I published a CodeQL Cheat Sheet to document tricky bits)!

Evaluation

Collected Data

I executed Qualys' original query ("Original"), the result of my optimization attempt ("Optimized"), Jordy Zomer's query ("Jordy"), and my final solution ("Refined") on the OpenSSH 9.9.p1 release. I took the 13 results of true positives reported by Qualys as a baseline, and counted false negatives and false positives. I also measured query execution times (as reported by the CodeQL query server) with freshly started query server instances to discount caching - that being said, take performance indicators with a grain of salt, as I only performed a small number of measurements.

Query	True Positives	"False" Positives	False Negatives	Run time (Wall-clock)
Original	13/13	37	0	60.0 s
Optimized	12/13	4	1	14.0 s
Jordy	5/13	0	8	16.6 s
Refined	12/13	18	1	6.4 s

To have a better idea about the complexities of the four queries I also timed the execution against OpenSSL. Again, this is not a rigorous benchmark, just a sample to give you an idea about the efficiency of the queries:

Query	Run time (Wall-Clock)	Results (with duplicates)
Original	> 3600 s (canceled after 5100 cycles in getRecPredecessor()	N/A
Optimized	175.9 s	71
Jordy	12.9 s	7
Refined	17.5 s	135

Conclusions

So what can we make of this data? Can we announce a winner? I think program analysis is more complicated than that.

For example, Jordy used perfectly reasonable, yet target-specific constraints that obviously limited true positives. But during practical use these constraints are easy to tweak to a new target, while they effectively limit the search space (e.g. by looking only for a single return value), increasing performance.

That being said, based on the data it's quite obvious that significant performance optimization was achieved while maintaining result quality. The true positive result missed by both of my queries is the process_ext_session_bind() function (ssh-agent.c:1742) and has the following return statement:

return r == 0 ? 1 : 0;

This is most likely not recognized because r is not used directly as a value of the return, but rather as part of a conditional "hidden" in the ternary expression. Extending the query to cover this case is an exercise for the reader :)

To finish things up, here are some of my high-level conclusions from this (not so) little exercise:

It's important to determine if a (sub)problem relates to control-flow, data-flow, or can even be solved by simple AST matching, so we can choose the most effective tools for our analysis.
Imperfect analysis can still yield interesting results, so it's worth to experiment.
It'd be great if we could use barriers in local data-flow analysis. If you have an idea for the implementation, please let me know!

I hope this case-study will help people climb CodeQL's learning curve - I surely did learn a lot in the process!

Code is on GitHub.

posted at 13:37 · Blog

Mar 06, 2025