PhASAR[TACAS'19]
An Inter-procedural Static Analysis Framework for C/C++[TACAS'19]
Introduction
Problem
-
Software written in C/C++ is hard to analyze because of:
- unsafe type system
- unrestricted use of pointers
- (for C++) virtual dispatch
Solution
-
This work presents the novel program-analysis framework PhASAR.
- PhASAR: a LLVM-based Static Analysis Framework
- PhASAR make static analysis more accessible and easy to use.
Result
- PhASAR provides whole-program analysis that scales well to real-world programs.
Data-Flow Analysis
Review: Basics of Data-Flow Analysis
Data-Flow Schema: common parts of data-flow analysis
-
Domain: The set of possible data-flow values
- Data-flow value: Analysis Interests
-
Direction: Data-flow propagation direction
- Forward or Backward
-
Transfer Function: Semantics of the statement
- Input: Data-flow values before the statement
- Output: Data-flow values after the statement
-
This function can be expressed as the following functions:
- gen: Create new data-flow values
- kill: Delete existing data-flow values
-
Meet: Operator that merge two (or more) data-flow values
-
This operator is used at:
- join points
- data-flow values update
-
This operator is used at:
- Initialize: Initial data-flow values
Example: Reaching Definitions
Reaching Definitions is a data-flow analysis that statically determines which definitions may reach a given point in the code.
-
ex) At L.5, the following definitions may reach:
3: x = 200;
4: y = 300;
Reaching Definitions | |
---|---|
Domain | Assignment(Store) Instructions |
Direction | Forward |
Transfer Function |
OUT = gen \( \cup \) (IN - kill) |
Meet | \( \cup \) |
Initialize | \( \phi \) |
Example: Reaching Definitions
1
2
3
4
5
6
7
8
9
10
11
int x, y, z;
x = 100;
x = 200;
y = 300;
if (...) {
y = 400;
z = 500;
} else {
y = 600;
z = 700;
}
-
ex) Transfer function at L.3
- kill: { x = 100 }
- gen: { x = 200 }
Data-Flow Frameworks: conditions for convergence
To solve data-flow problems efficiently, we put conditions on transfer function \(f\).
-
Monotone Framework
- Conditions: \(x \leq y \) implies \(f(x) \leq f(y)\)
- Another expression: \(f(x \land y) \leq f(x) \land f(y)\)
-
Distributive Framework
- Conditions: \(f(x \land y) = f(x) \land f(y)\)
-
Methods:
- IFDS (Interprocedural Finite Distributive Subset)
- IDE (Interprocedural Distributive Environments)
-
Pushdown Systems
- Conditions: Same as Distributive Framework
-
Methods:
- WPDS
Reaching Definitions can be solved by these frameworks.
We select monotone framework for reaching definitions.
Iterative Algorithm: solve the data-flow problem
Iterative Algorithm for a forward data-flow problem
-
node N: Unit for calculating the data-flow fact
- Per Statement
- Per Basic Block
- ...
for (each node N) OUT[N] = Init();
while (change to any OUT occur) {
for (each node N) {
IN[N] = Meet all OUT[predecessor];
OUT[N] = f(IN[N]);
}
}
1
2
3
4
5
6
7
8
9
10
11
int x, y, z;
x = 100;
x = 200;
y = 300;
if (...) {
y = 400;
z = 500;
} else {
y = 600;
z = 700;
}
3: x = 200
4: y = 300
Q1. OUT[L.7]
3: x = 200
6: y = 400
7: z = 500
Q2. OUT[L.10]
3: x = 200
9: y = 600
10: z = 700
Q3. OUT[L.11]
3: x = 200
6: y = 400
7: z = 500
9: y = 600
10: z = 700
PhASAR's Architecture
Goal of PhASAR
- Make static analysis more accessible and easy to use.
- Allow analysis designers to concentrate on the goal of a data-flow analysis.
To achieve
Precise data-flow analysis requires some information:
-
Control-flow Graph
- for propagation of data-flow facts
-
Class Hierarchy
- for precise type information
-
Points-to Information
- for precise analysis
-
Algorithmic Frameworks
- to solve data-flow problem
Architecture
-
Points-to Analysis
- Andersen-style
- Steensgard-style
- Inter-procedural Control Flow Graph
-
Type Hierarchy Analysis
- Construct the type hierarchy for composite types
- Reconstruct the virtual-method tables
-
Data-Flow Frameworks
- Monotone Framework
- IFDS
- IDE
- WPDS (Weighted Pushdown Systems)
How to use
- Choose data-flow framework
- Define your data-flow problem according to the framework
- PhASAR solve the problem automatically!!
Example: Encoding a Monotone Analysis
To solve data-flow problems using PhASAR, we need to define the followings:
- Framework + Direction
- Domain
- Initialize
- Transfer Function
- Meet
- Convergence Check
Let's encode reaching definitions using PhASAR!
If you are interested in implementation, see here.
Framework + Direction
-
Framework: Intra-procedural Monotone Framework
- The simplest framework
- Direction: Forward
1
2
3
class ReachingDefinitions : public IntraMonoProblem<Domain> {
// implement some functions here
};
Domain
-
Data-Flow Value (Fact): Assignment(Store) Instruction
- Collect definition statements.
-
Node N: Single Instruction
- Compute data-flow values for each statement.
1
2
3
4
5
6
7
8
struct ReachingDefinitionsDomain : public LLVMAnalysisDomainDefault {
// Data-flow fact
using d_t = const llvm::StoreInst *;
using mono_container_t = BitVectorSet<d_t>;
// Node N
using n_t = const llvm::Instruction *;
};
Initialize
-
Initialize: \( \phi \)
- There are no definitions at the start.
1
2
3
4
map<n_t, mono_container_t>
initialSeeds() {
return {};
}
Transfer Function
-
Transfer Function: \( gen \cup ( x - kill ) \)
- gen: a new definition statement
- kill: old statements that are assigned to the same variable as gen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
mono_container_t normalFlow(n_t inst, mono_container_t in) {
mono_container_t out = in;
if (inst is StoreInst) {
// kill
for (def : out) {
if (inst.targetOperand == def.targetOperand)
out.erase(def);
}
// gen
out.insert(inst);
}
return out;
}
Meet Operator
- Meet Operator: \( \cup \)
1
2
3
mono_container_t merge(mono_container_t lhs, mono_container_t rhs) {
return lhs.setUnion(rhs);
}
Convergence Check
- Check whether data-flow values are changed
1
2
3
bool equal_to(mono_container_t lhs, mono_container_t rhs) {
return lhs == rhs;
}
Analysis Result
Encoding Complete!!
Analyzing the example code, we obtain the following results.
1
2
3
4
5
6
7
8
9
10
11
int x, y, z;
x = 100;
x = 200;
y = 300;
if (...) {
y = 400;
z = 500;
} else {
y = 600;
z = 700;
}
3: x = 200
4: y = 300
3: x = 200
6: y = 400
7: z = 500
3: x = 200
9: y = 600
10: z = 700
3: x = 200
6: y = 400
7: z = 500
9: y = 600
10: z = 700
Evaluation
Scalability
Perform the following analyses on real programs:-
Trivial Analysis
- Framework: IFDS/IDE
- Data-flow fact: Tautological fact
- Baseline analysis for IFDS/IDE
-
Taint Analysis
- Framework: IFDS/IDE
- Data-flow fact: Tainted values
- provide PhASAR's IFDS/IDE analysis is scalable
Program | kLOC | Stmts | Ptrs | Allocs | Total Time[s] (Trivial / Taint) |
---|---|---|---|---|---|
wc | 132 | 63166 | 10644 | 396 | 2/13 |
ls | 152 | 71712 | 13200 | 438 | 4/5 |
cat | 130 | 62588 | 10584 | 391 | 2/3 |
cp | 141 | 64097 | 11722 | 443 | 3/792 |
whoami | 129 | 61860 | 10433 | 389 | 2/2 |
dd | 137 | 65287 | 11150 | 408 | 2/40 |
fold | 130 | 62201 | 10509 | 390 | 2/2 |
join | 134 | 64196 | 11042 | 402 | 2/2 |
kill | 130 | 62304 | 10527 | 394 | 2/2 |
uniq | 131 | 62663 | 10650 | 396 | 2/2 |
MPT | 3514 | 1351735 | 755567 | 176540 | 458/379 |
PhASAR | 3554 | 1368297 | 763796 | 178486 | 1064/993 |
Results of experiments: PhASAR is capable of analyzing even a million-line program within minutes.