Canary[PLDI’21]

1. Abstract

Problem

Concurrent programs are error prone because of
- interleavings of threads
- state explosion
- computational complexity
Precise inter-thread pointer analysis is difficult because of:
- thread interference (inter-thread data dependencies)
- difficulty in exploring only the necessary interleaving space for specific properties

Proposed Method: Canary

A value-flow analysis framework for concurrent programs
- Build inter-thread VFG
- Convert concurrency bug detection into source-sink reachability problem

Result

Detect 18 previously unknown concurrency bugs in famous projects

2. Goal: Detect concurrent bugs correctly

bool cond;

// Thread 1
void thread1() {
  int **x = malloc(); // alloc_x
  int *a  = malloc(); // alloc_a
  *x = a;             // alloc_x -> alloc_a
  fork(thread2, x);

  if(cond) {
    int *c = *x; // c -> {alloc_a, alloc_b}
    print(*c);   // May-use alloc_b...?
  }
}

//

// Thread 2
void thread2(int **y) {
  int *b = malloc(); // alloc_b

  if (!cond) {
    *y = b;          // alloc_x -> alloc_b
    free(b);         // free alloc_b
  }
}

Q. In this code, does use-after-free occur?

A. No, because of if(cond) // Thread 1 and if (!cond) // Thread 2

To answer this question correctly, we must know two things:

(Inter-thread) Data Dependencies
- b and c
Realizable Path (control-flow)
1. free(b)
2. print(*c)

Detection Target: src-sink reachability

Canary detect concurrency bugs via src-sink properties:

There are data-dependencies between src and sink.
There are realizable paths between src and sink.
Concurrency bugs
- Inter-thread use-after-free
- NULL pointer dereference
- Double free
- Information leak
- Race condition (order violation)

// Thread 1 (src)   // Thread 2 (sink)

// use-after-free
free(x);
                    use(x);

// race condition
                    if (i < N)
i++;
                    return arr[i];

3. Proposed Method: Canary

Workflow of Canary

(Intra-thread) Data Dependence Analysis
Interference Dependence Analysis
- a.k.a. Inter-thread Data Dependence Analysis
Source-Sink Checking
- check whether the execution is realizable

Preliminaries

Memory Consistency Model

We assume that the shared memory system is sequentially consistent.

Sequentially Consistent(SC)
- All memory operations are executed in order.
- All memory operations are atomic.
This assumption matches the natural expectation of us.

Language, Abstract Domain

Language:

Program

$ P := F^+ $

Function

$ F := func(v_1, v_2, ..., v_n) $

Statement

\[ \begin{eqnarray} S :=& \\ &|& v_1 = v_2 | v_1 = \&v_2 \\ &|& v_1 = *v_2 | *v_1 = v_2 \\ &|& v_1 = v_2 \text{ binop } v_3 | v_1 = \text{ unop } v_2 \\ &|& if (v) \, then \, S_1 \, else \, S_2 | return \, (x_0, ..., x_n) \\ &|& (x_0, ..., x_n) = call f(v_1, ..., v_n) | S_1; S_2 \\ &|& fork(t, f) | join(t) \end{eqnarray} \]

binop

$ := + | - | \land | \lor | > | = | \neq | ... $

unop

$ := - | \lnot | ... $

Abstract Domain

Threads

$ t \in T $

Labels

$ l \in L $

Objects (Address-taken variables)

$ o \in O $

Variables

$ v \in V $

Features:

No Loop
- Loop, Recursive Functions: Unrolling (2 times)
Side effects on the function parameters
- Introduce auxiliary variables for the objects passed into the function by references.
Partial SSA-form
- Direct flow about top variable $ v $ is obvious.
- We focus on indirect flow about address-taken variables $ o $.

3.1 Data Dependence Analysis: Construct VFG

Our goal:

Resolve intra-thread data dependence edges
Compute Guard condition
🗝 Points-to facts of address-taken variables

Steps

Intra-procedural path-sensitive pointer analysis
Procedural Transfer Function Computation
VFG Construction

1. Intra-procedural path-sensitive pointer analysis

        \begin{algorithm}
        \caption{Intra-procedural Pointer Analysis}
        \begin{algorithmic}

        \PROCEDURE{PathSensitivePTA}{function F}
          \FORALL{instruction I in F}
            \IF{$ l, \phi : p = alloc $}
              \STATE $ p \rightarrowtail (\phi, alloc) $
              \COMMENT{$\rightarrowtail$: points-to, $\phi$: path-condition}
            \ELSEIF{$ l, \phi : p = q $}
              \STATE $ p \rightarrowtail (\gamma \land \phi, o) , \forall (\gamma, o) \in Pts(q) $
            \ELSEIF{$ l, \phi : *x = q $}
              \IF{$ Pts(x) $ is a singleton}
                \STATE $ IN_l \leftarrow (IN_l \backslash Pts(x)) $ 
                \COMMENT{strong update}
              \ENDIF
              \STATE $ OUT_l \leftarrow IN_l \cup \{ Pts(x) \rightarrowtail (\gamma \land \phi, o ) \}, \forall (\gamma, o) \in Pts(q)$
            \ELSEIF{$ l, \phi : p = *y $}
              \STATE Let $o \in Pts(y) \cap IN_l$
              \STATE $ p \rightarrowtail (\gamma \land \phi, o'), \forall (\gamma, o') \in Pts(o)$
            \ELSEIF{$ l, \phi : (x_0, ..., x_n) = call f(v_1, ..., v_n) $}
              \STATE $ x_i \rightarrowtail (\phi, Trans(F, Pts(v_i))), \forall i \in [1, n]$
              \STATE \COMMENT{$Trans$: Summary of F}
            \ELSEIF{$ l, \phi : fork(t, f)$}
              \STATE continue
            \ENDIF
          \ENDFOR
        \ENDPROCEDURE

        \end{algorithmic}
        \end{algorithm}

Example: Thread1

// IN = {}
x = malloc();
// x -> (true, alloc_x)
a = malloc();
// a -> (true, alloc_a)
*x = a;
// OUT = {alloc_x -> (true, alloc_a)}
if (cond) {
  // IN = {alloc_x -> (true, alloc_a)}
  c = *x;
  // c -> (cond, alloc_a)
  print(*c);
}

void thread2(int **y)
// y -> (true, alloc_x)
b = malloc(); // alloc_b
if (!cond)
  *y = b;
  free(b);

Q. Answer points-to relations about Thread2.

A.

y -> (true, alloc_x)
b -> (true, alloc_b)
alloc_x -> (!cond, alloc_b)

2. Procedural Transfer Function Computation

Summarize $F$'s points-to side-effects.

Example: thread2()

// y -> o1 -> o2
void thread2(y) {
  b = malloc();   // alloc_b
  if (!cond) {
    *y = b;
    free(b);
  }
}
// y -> o1 -> alloc_b

Summary: $Trans(F, Pts(o1)) =$ {alloc_b}

3. VFG Construction

Basic rules for data dependencies

Instruction	Value Flow	Guard Condition
$l,\phi: p = alloc$	$alloc \rightarrow p$	$\phi$
$l,\phi : p = q$	$q \rightarrow p$	$\phi$
\begin{array} \\ l_1, \phi_1 : x = q; \\ l_2, \phi_2 : p = y; \end{array}	\begin{array} \\ \exists o \| (o, \alpha) \in Pts(x) \land (o, \beta) \in Pts(y) \\ q \rightarrow p \end{array}	$\phi_1 \land \phi_2 \land \alpha \land \beta$

Example: Value-Flow Graph

bool cond;

// Thread 1
void thread1() {
  int **x = malloc(); // alloc_x
  int *a  = malloc(); // alloc_a
  *x = a;             // alloc_x -> alloc_a
  fork(thread2, x);

  if(cond) {
    int *c = *x; // c -> {alloc_a, alloc_b}
    print(*c);   // May-use alloc_b...?
  }
}

3.2 Interference Dependence Analysis: Construct Inter-thread VFG

Our goal:

Resolve interference dependence edges
Compute Guard condition
🗝 Alias relationship between different threads

Steps

Escaped(Shared) Object Computation
Dependence Edges Computation
Dependence Guard Computation
Update until we reach a fixed-point

1. Escaped Object Computation

Objects pointed-to-by escaped objects are escaped objects.

$EspObj$: Escaped(Shared) Objects
$Pted$: Pointed-to-by Variables
- ex) $p \rightarrowtail o, Pts(p) = \{o\}, Pted(o) = \{p\} $

      \begin{algorithm}
      \caption{Escaped Object Computation}
      \begin{algorithmic}

        \PROCEDURE{EscapeAnalysis}{VFG}
          \STATE Initialize $EspObj$
          \COMMENT{objects passed to the fork calls}

          \FORALL{node $n$ in VFG}
            \IF{$n : *x = q$}
              \STATE Let $ x \rightarrowtail o$ and $ q \rightarrowtail o'$
              \IF{$o \in EspObj$}
                \STATE $EspObj \leftarrow EspObj \cup \{ o' \}$
              \ENDIF
            \ENDIF
          \ENDFOR

          \FORALL{escaped object $o$ in $EspObj$}
            \STATE Find the set of nodes $N$ that reachable from $o$
            \STATE Let $\sigma$ the aggregated guards of edges during traversal
            \FORALL{node $n$ in $N$}
              \STATE $Pted(o) \leftarrow Pted(o) \cup (n, \sigma)$
            \ENDFOR
          \ENDFOR

          \RETURN $EspObj$ and their $Pted$
        \ENDPROCEDURE

      \end{algorithmic}
      \end{algorithm}

Example: alloc_b in Thread2

void thread2(int **y)
// EspObj = {alloc_x}
b = malloc(); // alloc_b
if (!cond)
  // y -> alloc_x
  // b -> alloc_b
  *y = b;
  // EspObj += alloc_b

Results of EscapeAnalysis:

\begin{array} \\ EspObj = \{ alloc\_x, alloc\_b \} \\ Pted(alloc\_x) = \{ (true, x), (true, y) \} \\ Pted(alloc\_b) = \{ (!cond, b) \} \end{array}

2. Dependence Edges Computation

      \begin{algorithm}
      \caption{Dependence Edges Computation}
      \begin{algorithmic}

      \FORALL{store node $l_1, \phi_1 : *x = q$ in $t \in T$}
        \FORALL{load node $l_2, \phi_2 : p = *y$ in $t' \backslash t \in T$}
          \IF{$(x, \alpha), (y, \beta) \in Pted(o), o \in EspObj$}
            \STATE $\Phi_{Guard} = GuardComputation()$
            \STATE $l_1: *x = q \longrightarrow_{\Phi_{Guard}} l_2: p = *y$ in VFG
          \ENDIF
        \ENDFOR
      \ENDFOR

      \end{algorithmic}
      \end{algorithm}

Example:

// Thread 1   // Thread 2
c = *x; <---- *y = b;
// Pted(alloc_x) = {x, y}

3. Dependence Guard Computation

$ \Phi_{Guard} = \Phi_{alias} \land \Phi_{ls} $

$\Phi_{Guard}$: Dependence Guard Constraint
- $\Phi_{alias}$: Alias Constraint
- $\Phi_{ls}$: Load-Store Constraint

Alias Constraint

$\Phi_{alias} = \phi_1 \land \phi_2 \land \alpha \land \beta$ where:

$l_1, \phi_1 : *x = q \longrightarrow l_2, \phi_2 : p = *y$
$ x \rightarrowtail (\alpha, o)$
$ y \rightarrowtail (\beta, o)$
Conditions: $\phi_1, \phi_2, \alpha, \beta$

Example

$cond: c = *x;$
$!cond: *y = b;$
$x \rightarrowtail (true, alloc_x)$
$y \rightarrowtail (true, alloc_x)$

$ \therefore \Phi_{alias} = cond \land !cond \land true \land true$

Load-Store Constraint

$\Phi_{ls} = \bigwedge_{s, s' \in S(l)} \left( O_s < O_l \bigwedge_{\forall s' \neq s} O_{s'} < O_s \lor O_l < O_{s'} \right)$

$s$: store at $l_1$
$l$: load at $l_2$
$O_s < O_l$: strict partial order between $s$ and $l$
$s'$: other stores
$O_{s'} < O_s \lor O_l < O_{s'}$: no other stores between $s$ and $l$

Example

$ \therefore \Phi_{ls} = O_b < O_c $

Guard Constraint

$ \Phi_{Guard} = \Phi_{alias} \land \Phi_{ls} $

Example

$ \therefore \Phi_{Guard} = cond \land !cond \land O_b < O_c $

4. Update until we reach a fixed-point

We now add new dependence edges.
Thereby, there are new $Pts$ relations (e.g., $c \rightarrowtail alloc\_b$).
We re-compute the following steps until we reach a fixed point:

Intra-thread data dependence
Interference dependence

3.3 Source-Sink Checking: Check Realizable Value-Flow Path

Once the interference-aware VFG is built, some concurrency bugs can be reduced to solving source-sink reachability problems over VFG.

ex) Inter-thread use-after-free bug
- src: free statement
- sink: use statement

In this section, we address two problems to achieve precision and efficiency:

How to address the program order violation of a value-flow path?
How to efficiently solve the constraints for value-flow paths?

Partially-ordered Value Flows Searching

To check value-flow paths are realizable, we collect guard constraint $\Phi_{Guard}$ along the path. \begin{eqnarray} \\ && \pi = \langle v_1@l_1, v_2@l_2, ..., v_k@l_k \rangle \\ && src = v_1@l_1, sink = v_k@l_k \\ && (v_i@l_i, v_{i+1}@l_{i+1}) \text{ via either data-dependence or interference-dependence} \end{eqnarray}

Example

\begin{eqnarray} \\ && \pi = \langle free(b), b, c, print(*c) \rangle \\ && \Phi_{Guard}(\pi) = cond \land !cond \land O_b < O_c \end{eqnarray}

But this orders over $\pi$ may violate the program order.

Bad Example

\begin{eqnarray} \\ && \pi = \langle 2, 3, 4, 1 \rangle \\ && \Phi_{Guard}(\pi) = O_2 < O_3 \land O_3 < O_4 \land O_4 < O_1 \end{eqnarray}

This path $\pi$ is realizable...?

Partial Order Constraints: Intra-thread Program Order

$\Phi_{po}(\pi) = O_{l_1} < O_{l_2} < O_{l_3} < ... < O_{l_k}$

This constraints represent intra-thread program order.

Example

\begin{eqnarray} \\ && \pi = \langle 2, 3, 4, 1 \rangle \\ && \Phi_{Guard}(\pi) = O_2 < O_3 \land O_3 < O_4 \land O_4 < O_1 \\ && \Phi_{po}(\pi) = O_1 < O_2 \land O_3 < O_4 \end{eqnarray}

This path $\pi$ is not realizable!!
$ \because O_2 < O_3 \land O_3 < O_4 \land O_4 < O_1 \land O_1 < O_2 = false$

Summary: All Constraints

$\Phi_{all}(\pi) = \Phi_{Guard}(\pi) \land \Phi_{po}(\pi) $

$\Phi_{Guard}(\pi)$: Guard constraint via value-flow paths
$\Phi_{po}(\pi)$: Intra-thread program order constraint

Example

\begin{eqnarray} \\ && \pi = \langle free(b), b, c, print(*c) \rangle \\ && \Phi_{Guard}(\pi) = cond \land !cond \land O_b < O_c \\ && \Phi_{po}(\pi) = O_c < O_{print} \land O_b < O_{free} \\ && \Phi_{all} = \Phi_{Guard} \land \Phi_{po} \end{eqnarray}

Constraint Solving using SMT Solver

Finally, we check src-sink reachability by solving $\Phi_{all}$ using SMT solver.
To solve it effectively, we optimize by three means:

Filter out apparent contradictions
- ex) $ cond \land !cond = false$
The constraints on different src-sink paths are independent
- We can solve them in parallel.
Cube-and-conquer parallel SMT solving strategy
- ref) Cube-and-Conquer for Satisfiability

Example

\begin{eqnarray} \\ && \pi = \langle free(b), b, c, print(*c) \rangle \\ && \Phi_{Guard}(\pi) = cond \land !cond \land O_b < O_c \\ && \Phi_{po}(\pi) = O_c < O_{print} \land O_b < O_{free} \\ && \Phi_{all} = \Phi_{Guard} \land \Phi_{po} \end{eqnarray} $Solve(\Phi_{all}) = false$
$\therefore$ This use-after-free path $\pi$ is not realizable.

4. Implementation

Canary: on top of LLVM framework
- Support pthread, std::thread
- MHP(may-happen-in-parallel) analysis in interference dependence analysis
  - to prune value-flow paths
- Use Steengard's analysis for first CG
SMT Solver: Z3

5. Evaluation

Evaluation Points
- Comparison in Value-flow Graph Construction
- Comparison in Concurrency bug Detection
- Detected Real Concurrency Bugs
Analyzed Projects
- 20 Real-life open-source C/C++ projects
- ex) Firefox, MariaDB, MySQL, ...
Experiment Environment
- two 20 core Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
- 256GB physical memory
- Ubuntu-16.04

Comparison in Value-flow Graph Construction

Time cost

Canary
- finish all projects
- On average,
  - >15x faster than SABER
  - >180x faster than FSAM
SABER
- Times out on 9 projects
FSAM
- Times out on 15 projects

Memory cost

SABER uses 130G additional memory compared to Canary for the subjects larger than 100KLoc.
FSAM uses 200G additional memory compared to Canary for the subjects larger than 50KLoc.

Summary: Canary is much more efficient.
($\because$ FSAM and SABER perform an exhaustive (Andersen) points-to analysis.)

Comparison in Concurrency bug Detection

We choose to check inter-thread use-after-free because it has become the most exploited memory errors.

Scalability

We do not compare with FSAM and SABER because they suffer from the scalability problem in VFG construction.

Time/Memory
- Canary's cost grow almost linearly
- MySQL (3MLoc): 2.5 hours
- FireFox (9MLoc): 4.67 hours

Precision

Canary
- Reports: 15 bugs
- False positive rates: 26.67%
SABER
- Reports: 9896 bugs
- True positive in 100 sampling: 5 bugs
FSAM
- Reports: 586 bugs
- True positive in 100 sampling: 2 bugs

Summary: Canary is scalable and precise.
($\because$ Canary builds more precise dependence relations and takes the feasible interleaving into account.)

Detected Real Concurrency Bugs

Canary have already detected 18 concurrency bugs from dozens of open-source projects.

Famous sotfware systems: Firefox, LevelDB, Transission, ...
14 bugs have been fixed in the recent release versions.
aside) The confirmation of concurrency bugs was their team effort. (It is very difficult!!)

6. Related Work

Method	Techniques	Scalability	Precision	Recall	Description
FSAM[CGO'16]	Data-flow analysis	Middle	Middle	High	Andersen-style PTA
MCR[OSDI'15]	Model checking(Partial order reduction)	Low	High	Middle	Fixed input execution
Nishant et. al. [OOPSLA'14]	Symbolic Execution	Low	High	High	All feasible paths
OWL[DSN'18]	Concurrency Bug Confirmation	-	High	-	Confirm 36 bugs among reported 24645
DCUAF [USENIX ATC'19]	Static Analysis	High	High	High	Specifically for Linux driver
Krace [SP'20]	Fuzzing	Middle	High	Middle	For Kernel File Systems
SmartTrack [PLDI'20]	Dynamic Analysis	High	High	Low	under a specific input
Canary[PLDI'21]	Data-flow analysis	High	High	High	Interference Dependency

7. Conclusion

Goal: Detect inter-thread value-flow bugs
Approach: Interference-aware value-flow analysis
- Canary archives both good precision and scalability.
Results: 18 previously-unknown concurrency bugs on famous projects
Future work
- Enhancements to support more synchronization semantics (e.g., lock/unlock, signal/notify)
- Relaxed memory models
- Customized decision procedures to improve the efficiency of constraint solving

8. Relation to my research

Canary (再掲)

Canaryは，src-sink reachabilityに帰着できる並行バグを検出する．
- e.g., Use-after-free, NULL pointer dereference, Double free, Race condition

// Thread 1   // Thread 2
free(x);
              use(x);

自分の研究

データフロー保護を並行プログラムに適用することで，src-sink reachabilityに帰着できる並行バグを検出できるのでは？
- Inter-thread VFGを構築し，スレッド間のデータフローを求める．
- 合法なデータフローを保護する．
  - 並行バグを引き起こすデータフローを違法とする
  - その他のスレッド間データフローを合法とする
  - 要考察: 並行バグを引き起こすデータフロー
SoundなInterference VFGを構築
- Canaryは，ループを展開するなど一部unsoundな解析を行う
- 要考察: DFIの観点では，soundなVFGを構築したい