Introduction

Problem

  • Software written in C/C++ is hard to analyze because of:
    • unsafe type system
    • unrestricted use of pointers
    • (for C++) virtual dispatch

Solution

  • This work presents the novel program-analysis framework PhASAR.
    • PhASAR: a LLVM-based Static Analysis Framework
  • PhASAR make static analysis more accessible and easy to use.

Result

  • PhASAR provides whole-program analysis that scales well to real-world programs.

Data-Flow Analysis

Review: Basics of Data-Flow Analysis

Data-Flow Schema: common parts of data-flow analysis
  • Domain: The set of possible data-flow values
    • Data-flow value: Analysis Interests
  • Direction: Data-flow propagation direction
    • Forward or Backward
  • Transfer Function: Semantics of the statement
    • Input: Data-flow values before the statement
    • Output: Data-flow values after the statement
    • This function can be expressed as the following functions:
      • gen: Create new data-flow values
      • kill: Delete existing data-flow values
  • Meet: Operator that merge two (or more) data-flow values
    • This operator is used at:
      • join points
      • data-flow values update
  • Initialize: Initial data-flow values
Example: Reaching Definitions

Reaching Definitions is a data-flow analysis that statically determines which definitions may reach a given point in the code.

  • ex) At L.5, the following definitions may reach:
    • 3: x = 200;
    • 4: y = 300;
Reaching Definitions
Domain Assignment(Store) Instructions
Direction Forward
Transfer
Function
OUT = gen \( \cup \) (IN - kill)
Meet \( \cup \)
Initialize \( \phi \)

Example: Reaching Definitions

1
2
3
4
5
6
7
8
9
10
11
int x, y, z;
x = 100;
x = 200;
y = 300;
if (...) {
  y = 400;
  z = 500;
} else {
  y = 600;
  z = 700;
}
  • ex) Transfer function at L.3
    • kill: { x = 100 }
    • gen: { x = 200 }
Data-Flow Frameworks: conditions for convergence

To solve data-flow problems efficiently, we put conditions on transfer function \(f\).

  • Monotone Framework
    • Conditions: \(x \leq y \) implies \(f(x) \leq f(y)\)
    • Another expression: \(f(x \land y) \leq f(x) \land f(y)\)
  • Distributive Framework
    • Conditions: \(f(x \land y) = f(x) \land f(y)\)
    • Methods:
      • IFDS (Interprocedural Finite Distributive Subset)
      • IDE (Interprocedural Distributive Environments)
  • Pushdown Systems
    • Conditions: Same as Distributive Framework
    • Methods:
      • WPDS

Reaching Definitions can be solved by these frameworks.
We select monotone framework for reaching definitions.

Iterative Algorithm: solve the data-flow problem

Iterative Algorithm for a forward data-flow problem

  • node N: Unit for calculating the data-flow fact
    • Per Statement
    • Per Basic Block
    • ...
for (each node N) OUT[N] = Init();
while (change to any OUT occur) {
  for (each node N) {
    IN[N] = Meet all OUT[predecessor];
    OUT[N] = f(IN[N]);
  }
}
1
2
3
4
5
6
7
8
9
10
11
int x, y, z;
x = 100;
x = 200;
y = 300;
if (...) {
  y = 400;
  z = 500;
} else {
  y = 600;
  z = 700;
}
OUT[L.4]
3: x = 200
4: y = 300
Q1. OUT[L.7]
3: x = 200
6: y = 400
7: z = 500
Q2. OUT[L.10]
3:  x = 200
9:  y = 600
10: z = 700
Q3. OUT[L.11]
3:  x = 200
6:  y = 400
7:  z = 500
9:  y = 600
10: z = 700

PhASAR's Architecture

Goal of PhASAR

  • Make static analysis more accessible and easy to use.
  • Allow analysis designers to concentrate on the goal of a data-flow analysis.

To achieve

Precise data-flow analysis requires some information:

  • Control-flow Graph
    • for propagation of data-flow facts
  • Class Hierarchy
    • for precise type information
  • Points-to Information
    • for precise analysis
  • Algorithmic Frameworks
    • to solve data-flow problem

Architecture

  • Points-to Analysis
    • Andersen-style
    • Steensgard-style
  • Inter-procedural Control Flow Graph
  • Type Hierarchy Analysis
    • Construct the type hierarchy for composite types
    • Reconstruct the virtual-method tables
  • Data-Flow Frameworks
    • Monotone Framework
    • IFDS
    • IDE
    • WPDS (Weighted Pushdown Systems)
PhASAR's high-level architecture

How to use

  1. Choose data-flow framework
  2. Define your data-flow problem according to the framework
  3. PhASAR solve the problem automatically!!

Example: Encoding a Monotone Analysis

To solve data-flow problems using PhASAR, we need to define the followings:

  • Framework + Direction
  • Domain
  • Initialize
  • Transfer Function
  • Meet
  • Convergence Check

Let's encode reaching definitions using PhASAR!

If you are interested in implementation, see here.

Framework + Direction

  • Framework: Intra-procedural Monotone Framework
    • The simplest framework
  • Direction: Forward
1
2
3
class ReachingDefinitions : public IntraMonoProblem<Domain> {
  // implement some functions here
};

Domain

  • Data-Flow Value (Fact): Assignment(Store) Instruction
    • Collect definition statements.
  • Node N: Single Instruction
    • Compute data-flow values for each statement.
1
2
3
4
5
6
7
8
struct ReachingDefinitionsDomain : public LLVMAnalysisDomainDefault {
  // Data-flow fact
  using d_t = const llvm::StoreInst *;
  using mono_container_t = BitVectorSet<d_t>;
  
  // Node N
  using n_t = const llvm::Instruction *;
};

Initialize

  • Initialize: \( \phi \)
    • There are no definitions at the start.
1
2
3
4
map<n_t, mono_container_t>
initialSeeds() {
  return {};
}

Transfer Function

  • Transfer Function: \( gen \cup ( x - kill ) \)
    • gen: a new definition statement
    • kill: old statements that are assigned to the same variable as gen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
mono_container_t normalFlow(n_t inst, mono_container_t in) {
  mono_container_t out = in;
  if (inst is StoreInst) {
    // kill
    for (def : out) {
      if (inst.targetOperand == def.targetOperand)
        out.erase(def);
    }
    
    // gen
    out.insert(inst);
  }
  return out;
}

Meet Operator

  • Meet Operator: \( \cup \)
1
2
3
mono_container_t merge(mono_container_t lhs, mono_container_t rhs) {
  return lhs.setUnion(rhs);
}

Convergence Check

  • Check whether data-flow values are changed
1
2
3
bool equal_to(mono_container_t lhs, mono_container_t rhs) {
  return lhs == rhs;
}

Analysis Result

Encoding Complete!!

Analyzing the example code, we obtain the following results.

1
2
3
4
5
6
7
8
9
10
11
int x, y, z;
x = 100;
x = 200;
y = 300;
if (...) {
  y = 400;
  z = 500;
} else {
  y = 600;
  z = 700;
}
OUT[L.4]
3: x = 200
4: y = 300
OUT[L.7]
3: x = 200
6: y = 400
7: z = 500
OUT[L.10]
3:  x = 200
9:  y = 600
10: z = 700
OUT[L.11]
3:  x = 200
6:  y = 400
7:  z = 500
9:  y = 600
10: z = 700

Evaluation

Scalability

Perform the following analyses on real programs:
  • Trivial Analysis
    • Framework: IFDS/IDE
    • Data-flow fact: Tautological fact
    • Baseline analysis for IFDS/IDE
  • Taint Analysis
    • Framework: IFDS/IDE
    • Data-flow fact: Tainted values
    • provide PhASAR's IFDS/IDE analysis is scalable
Program kLOC Stmts Ptrs Allocs Total Time[s]
(Trivial / Taint)
wc 132 63166 10644 396 2/13
ls 152 71712 13200 438 4/5
cat 130 62588 10584 391 2/3
cp 141 64097 11722 443 3/792
whoami 129 61860 10433 389 2/2
dd 137 65287 11150 408 2/40
fold 130 62201 10509 390 2/2
join 134 64196 11042 402 2/2
kill 130 62304 10527 394 2/2
uniq 131 62663 10650 396 2/2
MPT 3514 1351735 755567 176540 458/379
PhASAR 3554 1368297 763796 178486 1064/993

Results of experiments: PhASAR is capable of analyzing even a million-line program within minutes.