Virtual Machines Ch.4
-- Dynamic Binary Optimization
Review + Overview
Review
In Chapter.4 Dynamic Binary Optimization, the discussion is in the following:
- Typical program behaviour for optimization heuristics (4.1)
- Profiling techniques to collect profile information (4.2)
- Forming large translation/optimization blocks using profile (4.3)
- Overall view of dynamic optimization (4.4)
- Reordering optimization (4.5)
- ...
Overview of section 4.4.4 and 4.4.3
- 4.4 Optimization Framework
- Overall view of dynamic optimization
- 4.4.1 Approarch
- Overall approach to dynamic optimization
- 4.4.2 Optimization and Compatibility
- Compatibility issues
- 4.4.3 Consistent Register Mapping
- Register mapping issues
- 4.5 Code Reordering
- Reordering optimization
- 4.5.1 Primitive Instruction Reordering
- Simple swap reordering
- 4.5.2 Implementing a Scheduling Algorithm
- Complete reordering algorithm
- 4.5.3 Superblocks Versus Traces
- Granularity of dynamic optmization
4.4 Optimization Framework
In this section, we discuss the overall issues that are common to all the dynamic optimization methods.
First, we compare dynamic optimization with static optimization:
Static optimization (cf. Dragon Book) |
Dynamic optimization (in this section) |
|
---|---|---|
Timing | At compilation (before execution) |
At run time (during or after execution) |
Types of analyzed program | High-level language, IR (Binary) | Binary (Bytecode) |
Analysis granularity | Function (intra-procedural), Overall (inter-procedural) |
Trace, Superblock, (dynamic basic block) |
Available information | High-level semantics (e.g., data structure declarations) |
Profile information (e.g., frequency of passing paths) |
Desired property | Slow but highly optimization | Fast, low overhead methods (because optimization is performed at run time) |
4.4.1 Approach
In this subsection, we discuss overall approach to dynamic optimization.
Overall Approach
- Collect basic blocks from code cache using profile information
- Convert to intermediate (SSA) form; place in buffer
-
Schedule and optimize
- Generate side table information (for precise state recovery)
- Generate target code
- Add compensation code; place in code cache
In brief, overall approach is as follows:
- Form superblock
- Convert to SSA
- Optimize!
Optimization of Traces or Superblocks
-
Trace: multiple entries, multiple exits
- In optimization, compensation code are inserted into entry and exit points.
-
Superblock: single entry, multiple exits
- In optimization, compensation code are inserted into exit points.
Compensation code compensates for the code lost due to optimization. We will discuss about compensation code in 4.5.1 Primitive Instruction Reordering.
4.4.2 Optimization and Compatibility
As disscussed in section 3.2, we must maintain compatibility even after optimization.
Review: Compatibility for process VM
We must maintain an isomorphic relationship between guest VM and host platform.
To achive this, we focus on all control transfer points between guest and host:
- system call
- return from system call
- trap : concerns about optimizations
- interrupt
And at these points, we must form the precise state:
- same register state
- same memory state
Trap compatibility
A process VM is trap compatible if:
- any trap occur during the native execution if the trap occur during emulation.
- any trap occur during emulation if the trap occur during the native execution.
Note: Except for a page fault (in Chapter.7)
Briefly, Trap timing in guest == Trap timing in host.
Trap compatibility must be maintained before and after optimization.
Remove redundant instructions
r4 = r6 + 1
r1 = r2 + r3 // redundant but may trap
r1 = r4 + r5
r6 = r1 + r7
R4 = R6 + 1
// R1 = R2 + R3 // no trap occur
R1 = R4 + R5
R6 = R1 * R7
In this example, the optimization violate trap compatibility because:
- In source,
r1 = r2 + r3
may trap. - In target, no trap occurs because
r1 = r2 + r3
is removed.
The converse condition for trap compatibility can be maintained.
Handling of devide-by-zero
r1 = 0
r3 = r2 / r1 // no trap
R1 = 0
R3 = R2 / R1 // devide-by-zero trap
In this example, runtime can maintain trap compatibility because:
- In source,
r3 = r2 / r1
do not trap. - In target, runtime can ignore divide-by-zero trap while divide-by-zero trap occurs.
Memory and Register Compatibility at Trap Instructions
Memory and register state compatibility are maintained if:
- the runtime can reconstruct the same memory state as the state on guest VM.
- the runtime can reconstruct the same register state as the state on guest VM.
Briefly, memory/register state in guest == memory/register state in host.
Memory and register state compatibility must be maintained even after optimization.
Code reordering optimization may violate register state compatibility.
1: r1 = r2 + r3
2: r9 = r1 + r5 // trap occurs
3: r6 = r1 * r7 // time consuming
4: r3 = r6 + 1
1: R1 = R2 + R3
3: R6 = R1 * R7 // reordered
2: R9 = R1 + R5 // trap occurs
4: R3 = R6 + 1
It might be beneficial to reorder r6 = r1 * r7
to a higher point because multiply take several cycles.
But reordering line.2 and line.3 violates register state compatibility because when r9 = r1 + r5
trap:
- in souce,
r6
has not been updated yet. - in target,
R6
has been updated.
1 : R1 = R2 + R3
3': S1 = R6 * R7 // S1 hold the computation result
2 : R9 = R1 + R5 // trap occurs
3 : R6 = S1 // R6 is updated in the original order
4 : R3 = S1 + 1
To maintain register state compatibility, saved register S1
hold the multiply results.
This optimized code update registers in the same order as original source.
Summary
After optimization, we must maintain compatibility:
- Trap compatibility
- Trap timing in guest == Trap timing in host
- Memory/Register compatibility
- Memory/Register state in guest == Memory/Register state in host
Answer whether this optimization maintains compatibility or not.
1 : mem[r1] = r2
2 : r1 = r2 * r3
3 : ...
2': S1 = r2 * r3
1 : mem[r1] = r2
2 : r1 = S1
3 : ...
Yes!! because saved register have backup of
r1
.
1 : r1 = r2 + r3
2 : if false br exit
3 : r4 = mem[r1]
4 : ...
1 : r1 = r2 + r3
// if false br exit
3 : r4 = mem[r1];
4 : ...
Yes!! because branch instructions cannot trap.
4.4.3 Consistent Register Mapping
Register mapping was discussed in Section 2.5, 2.8.1, 3.3.1.
In optimization, consistent register mapping is important for performance.
There are three cases for register mapping:
-
source register << target register
- Each source register is permanently mapped to the corresponding target register.
- Methods: fixed register mapping
- source register ≈ target register
-
source register > target register
- All source registers cannot be permanently mapped to target registers.
-
Two types of mapping methods:
- Inconsistent Register Mapping
- Consistent Register Mapping
Inconsistent Register Mapping
Inconsistent Register Mapping has the following properties:
-
Source registers are mapped on a per-translation-block basis.
- Different translation blocks may have different register mappings.
- Ex) While
r1
is used forR1
in block A, it is also used forR2
in block B.
- Extra register copy is required for consistency.
Inconsistent Register Mapping
Source
R1 = 100
br C
R1 = 200
if cond br C
C:
R5 = R4 + R1
...
Target with inconsistent register mapping
// R1 <-> r1
r1 = 100
br C
// R1 <-> r2
r2 = 200
if cond, br C
// R1 <-> r3
C:
r5 = r4 + r3
...
This target code have a problem about R1
register mapping because:
r1
is used in superblock A.r2
is used in superblock Br3
is used in superblock C
Target with extra register copy
// R1 <-> r1
r1 = 100;
br C'
C': r3 = r1 // R1 <-> r3
br C
// R1 <-> r2
r2 = 200
if cond, br C''
C'': r3 = r2 // R1 <-> r3
if cond, br C
// R1 <-> r3
C:
r5 = r4 + r3
...
If translated codes jump to interpreter, the runtime must copy target register values into the register context block, too.
Consistent Register Mapping
Consistent Register Mapping has the following properties:
-
Some source registers are mapped to target registers among translation blocks.
R1
are mapped tor1
in all translation blocks.
-
Other source registers are mapped to memory region.
R10
are mapped to register context block.
- Memory accesses may occur.
Simple Fixed Register Mapping
Source
R1 = 100
R10 = 200
br F
R1 = 300
R10 = 400
br F
F:
R9 = R10 + R1
...
Target with consistent register mapping
// R1 <-> r1
// R10 <-> reg[10]
r1 = 100
reg[10] = 200
br F
// R1 <-> r1
// R10 <-> reg[10]
r1 = 300
reg[10] = 400
br F
// R9 <-> reg[9]
F:
reg[9] = reg[10] + r1
...
This example shows that memory accesses (reg[9], reg[10]
) occur.
In Section 4.6.3, consistent but not fixed methods are discussed.
4.5 Code Reordering
Code Reordering is an important optimization because:
-
In simple pipeline microarchitectures, instructions are executed strictly in program order.
- Code Reordering improve pipeline performance.
-
Source binary is not ordered for target processors.
- In VLIW architectures, VLIW processors assume that the compiler reorder instructions to gather independent instructions.
-
In dynamic out-of-order superscalar processors:
- Code Reordering have an effect on functional unit latencies (especially cache memory latencies).
Because of code reordering's importance, we consider it before looking at other optimizations.
We first discuss several important issues about code reordering.
4.5.1 Primitive Instruction Reordering
In this subsection, we consider reordering pairs of instructions (swapping one for another).
We discuss in the following steps:
-
Categorize the instructions into four categries.
- reg: Register updates
- mem: Memory updates
- br: Branch instructions
- join: Join points
-
Discuss pairwise-reordering (4 × 4 = 16).
- br ⇆ another
- join ⇆ another
- remaining pairs
Four instruction categories
-
reg: Register updates
- Description: instructions that produce a register result
-
Property: can be "undone" (using saved registers, saving into memory)
- Register access is fast.
-
Examples:
mov rax, [rbx] // rax = mem[rbx]
add r1, r2, r3 // r1 = r2 + r3
-
mem: Memory updates
- Description: instructions that place a value in a memory
-
Property: cannot be "undone" easily (because saving memory have high cost of time and memory)
- Memory access is slow.
-
Examples:
mov [rax], rbx // mem[rax] = rbx
stw r1, 4(r2) // mem[r2 + 4] = r1
-
br: Branch instructions
- Property: do not trap
- Example:
bz label // if ZF == 0 then br label
-
join: Join points
- Description: the points where a jump or branch target enters the code sequence
- Property: join are not instructions but can affect reordering
- Example:
br label ... label: // join points ...
In addition, volatile memory access is categorized as mem.
-
Access to
volatile memory locations
-
A
volatile memory location
may be accessed by an entity other than the process.
- Shared memory with other process or thread
- Memory-mapped I/O with I/O system
-
Property: cannot be reordered in many cases:
- We cannot move an access to shared memory before getting lock (mutex).
- An access to memory-mapped I/O may trigger some I/O device functions.
- Category: mem (because state cannot be recovered easily)
-
A
volatile memory location
may be accessed by an entity other than the process.
In some VM applications, the emulation system must optimize conservatively because it does not know whether the memory access is volatile or not.
Four instruction categories
R1 = mem[R6] // reg (mem if mem[R6] is volatile)
R2 = mem[R6 + 4] // reg (mem if mem[R6 + 4] is volatile)
R3 = R1 + 1 // reg
R4 = R1 << 2 // reg
br exit if R7 == 0 // br
R7 = R7 + 1 // reg
mem[R6] = R3 // mem
exit: // join
We will now discuss the reordering of two adjacent instructions.
There are two keys to understanding well:
-
Are memory and register states recoverable at trap instructions?
- If so, the reordering is allowed.
- If not, the reordering is not allowed.
-
Does the reordering change control flow?
- If so, the reordering is not allowed.
- If not, the reordering is allowed.
br ⇆ another
Move reg or mem below br
We can move reg or mem below br because:
- The moved instruction is executed both under branch-taken and not-taken.
- br do not trap.
- Not change control flow.
Move reg below br
Before reordering
1 : R3 = R1 + 1
2 : R4 = R1 << 2 // reg
3 : br exit if R7 == 0 // br
4 : R7 = R7 + 1
100 : exit:
101 : ...
After reordering
1 : R3 = R1 + 1
3 : br 2' if R7 == 0 // br
2 : R4 = R1 << 2 // reg
4 : R7 = R7 + 1
2': R4 = R1 << 2
: br exit
100 : exit:
101 : ...
Compensation code ensures that the state is the same as original code.
Move reg above br
We can move reg above br because:
- The original register value of
R
is remained while the moved instruction is executed. - Not change control flow.
Move reg above br
1 : R2 = R1 << 2
2 : br exit if R8 == 0 // br
3 : R6 = R7 * R2 // reg
4 : mem[R6] = R3
5 : R6 = R2 + 2
1 : R2 = R1 << 2
3': T1 = R7 * R2 // saved register
2 : br exit if R8 == 0 // br
3 : R6 = T1 // reg
4 : mem[R6] = R3
5 : R6 = R2 + 2
There are two reasons why we insert saved register T1
:
-
R6
must be live at labelexit
. -
R6
must be live until the original position (between ins.2 and ins.4).- Live means that the original values are held.
If we do not use R6
, we can omit the explicit register copy.
Important point is that the correct value of R6 is available (R6
holds old value, T1
holds new value).
1 : R2 = R1 << 2
3': T1 = R7 * R2 // saved register
2 : br exit if R8 == 0 // br
// 3: R6 = T1
4 : mem[T1] = R3
5 : R6 = R2 + 2
Move mem above br
We cannot move mem above br because:
-
The moved instruction is executed both under branch-taken and not-taken.
- Originally, the instruction is executed only under branch-not-taken.
- mem cannot be "undone". (or need extra instructions)
Move mem above br
Before reordering
1 : br exit if R8 == 0 // br
2 : mem[R1] = R2 // mem
3 : ...
4 : exit:
5 : ...
After reordering
2 : mem[R1] = R2 // mem
1 : br exit if R8 == 0 // br
3 : ...
4 : exit:
// cannot recover mem[R1]
5 : ...
This example shows that
mem[R1]
cannot be recoverable.
Memory compatible but slow reordering
2': T1 = mem[R1] // extra load
2 : mem[R1] = R2 // mem
1 : br 2'' if R8 == 0 // br
3 : ...
2'': mem[R1] = T1 // recover
: br exit
4 : exit:
5 : ...
This reordering looks good, but an extra memory access is added.
From this, we assume that mem cannot be recoverable ("undone").
Move br above/below join
We cannot move br above/below join because:
- Change control flow.
Move br above join
1 : ...
2 : br label
3 : label: // join
4 : br exit if R1 == 0 // br
1 : ...
2 : br label
4 : br exit if R1 == 0 // br
3 : label: // join
This example shows that moving br around join changes control flow.
Swap br and br
We cannot swap br and br because:
- Change control flow.
1 : ...
2 : br label1 // br
3 : br label2 // br
1 : ...
3 : br label2 // br
2 : br label1 // br
Obviously control flow is changed.
Answer whether this optimization is allowed and why.
1 : if cond br label
2 : r1 = mem[r2]
2 : r1 = mem[r2]
1 : if cond br label
No. A saved register is needed.
2': t1 = mem[r2]
1 : if cond br label
2 : r1 = t1
1 : mem[R1] = R2
2 : br label
3 : label:
4 : ...
2 : br label
1 : mem[R1] = R2
3 : label:
4 : ...
No. Compensation code is needed.
2 : br label'
1 : mem[R1] = R2
5 : label':
6 : mem[R1] = R2
7 : br label
join ⇆ another
Move reg/mem above join
We can move reg/mem above join because:
- The moved instruction is executed both in fallthrough and join-from-outside (by inserting compensation code).
- Not change control flow.
Move reg above join
1 : ...
2 : br label
: ...
3 : label: // join
4 : R7 = mem[R6] // reg
5 : ...
1 : ...
2 : br 4'
4': R7 = mem[R6] // compensation
: br label
: ...
4 : R7 = mem[R6] // moved above
3 : label: // join
5 : ...
This reordering alone is meaningless, but this enable more optimizations.
Move reg or mem below join
We cannot move reg/mem below join because:
-
The moved instruction is executed both in fallthrough and join-from-outside.
- Originally, the instruction is executed only in fallthrough.
Move reg below join
1 : R1 = 100
2 : br label
3 : R1 = 200 // reg
4 : label: // join
5 : ...
1 : R1 = 100
2 : br label
4 : label: // join
3 : R1 = 200 // reg
5 : ...
This example shows that execution path 2-4-3 changes
R1
value.
Swap join points
We can swap join and join because:
-
Swapping join and join have no effect.
- join is a point and not an instruction.
Swap join points
1 : R1 = 100
2 : br label1
3 : R2 = 200
4 : br label2
5 : label1: // join
6 : label2: // join
7 : R3 = R1 + R2
1 : R1 = 100
2 : br label1
3 : R2 = 200
4 : br label2
5 : label2: // join
6 : label1: // join
7 : R3 = R1 + R2
This reordering have no effect clearly.
Answer whether this optimization is allowed and why.
1 : ...
2 : br label
3 : label: // join
4 : mem[R1] = R2 // mem
5 : ...
1 : ...
2 : br label'
4': label'
: mem[R1] = R2
: br label
4 : mem[R1] = R2 // mem
3 : label: // join
5 : ...
Yes. Compensation code ensures that
mem[R1] = R2
is executed both in fallthrough and join-from-outside.
: ...
1 : mem[R1] = 100
2 : br label
: ...
4 : mem[R1] = 200 // mem
5 : label: // join
: ...
1 : mem[R1] = 100
2 : br label
: ...
5 : label: // join
4 : mem[R1] = 200 // mem
No. The value of
mem[R1]
is changed in join-from-outside.
Straight-line reordering (reg/mem ⇆ reg/mem)
Move reg above reg/mem
We can move reg above reg/mem because:
- reg can be "undone" using saved registers.
- Not change control flow.
Move reg above mem
1 : R1 = R1 * 3
2 : mem[R6] = R1 // mem
3 : R7 = R7 << 3 // reg
4 : R9 = R7 + R2
1 : R1 = R1 * 3
3': T1 = R7 << 3 // saved reg
2 : mem[R6] = R1 // mem
3 : R7 = T1 // reg
4 : R9 = T1 + R2
If mem[R6] = R1
traps, R7
holds the same value as source.
Move mem above reg/mem
We cannot move mem above reg/mem because:
- mem cannot be "undone".
Move mem above reg
1 : R1 = R2 + R3 // reg
2 : mem[R4] = 100 // mem
3 : ...
2 : mem[R4] = 100 // mem
1 : R1 = R2 + R3 // reg
3 : ...
We assume that mem cannot be recoverable ("undone") before.
If
1: R1 = R2 + R3
trap, we cannot recover the original value of mem[R4]
.
Answer whether this optimization is allowed and why.
1 : R1 = R2 + R3 // reg
2 : R4 = R1 * R5 // reg
2': T1 = R1 * R5 // saved register
1 : R1 = R2 + R3 // reg
2 : R4 = T1 // reg
No. We must use the value of
R1 = R2 + R3
at T1 = R1 * R5
.
Summary
First (moved down) | ||||
---|---|---|---|---|
Second (moved up) | reg | mem | br | join |
reg | Extend live range of reg instruction | Extend live range of reg instruction | Extend live range of reg instruction | Add compensation code at entrance |
mem | Not allowed | Not allowed | Not allowed | Add compensation code at entrance |
br | Add compensation code at branch exit | Add compensation code at branch exit | Not allowed (changes cntrol flow) | Not allowed (changes cntrol flow) |
join | Not allowed (can only be done in rare cases) | Not allowed (can only be done in rare cases) | Not allowed (changes cntrol flow) | No effect |
Possible reordering is the followings:
-
Move
reg
above any instructions (using saved registers)
- Register states can be recoverable.
- Move reg or mem below br (by adding compensation code)
- Move mem(/reg) above join (by adding compensation code)
4.5.2 Implementing a Scheduling Algorithm
In this subsection, we consider a complete code-scheduling algorithm.
This algorithm have the following properties:
- Optimization granularity: superblock
-
Register states are always recoverable.
- Register live ranges are extended by backups.
-
Handling precise traps:
- Recover the original register state at the checkpoint.
- Interpret the original source code from the checkpoint.
We will use the translations from IA-32 to PowerPC as an example.
a : add eax, ebx ; eax += ebx
b : bz L1 ; br L1 if ZF == 0
c : mov ebx, [eax + 4] ; ebx = mem[eax + 4]
d : mul ebx, 10 ; ebx *= 10
e : add ebx, 1 ; ebx += 1
f : add ecx, 1 ; ecx += 1
g : bz L2 ; br L2 if ZF == 0
h : add ebx, eax ; ebx += eax
i : br L3 ; br L3
Step 1: Translate to Single-Assignment Form
Original source code is translated to target code and placed in code cache.
a : add eax, ebx ; eax += ebx
b : bz L1 ; br L1 if ZF == 0
c : mov ebx, [eax + 4] ; ebx = mem[eax + 4]
d : mul ebx, 10 ; ebx *= 10
e : add ebx, 1 ; ebx += 1
f : add ecx, 1 ; ecx += 1
g : bz L2 ; br L2 if ZF == 0
h : add ebx, eax ; ebx += eax
i : br L3 ; br L3
a : add r1, r1, r2
b : bz CR0, L1
c : lwz r2, 4(r1)
d : mul r2, r2, 10
e : add r2, r2, 1
f : add r3, r3, 1
g : bz CR0, L2
h : add r2, r1, r2
i : b L3
The translated target code is:
- converted in SSA form.
- pushed into a scheduling buffer.
a : add r1, r1, r2
b : bz CR0, L1
c : lwz r2, 4(r1)
d : mul r2, r2, 10
e : add r2, r2, 1
f : add r3, r3, 1
g : bz CR0, L2
h : add r2, r1, r2
i : b L3
a : t5 = r1 + r2, set CR0
b : bz CR0, L1
c : t6 = mem[t5 + 4]
d : t7 = t6 * 10
e : t8 = t7 + 1
f : t9 = r3 + 1
g : bz CR0, L2
h : t10 = t8 + t5
i : b L3
Step 2: Form Register Map
Register Map (RegMAP) maintains a mapping from the source registers to the single-assignment registers.
a : add eax, ebx
b : bz L1
c : mov ebx, [eax + 4]
d : mul ebx, 10
e : add ebx, 1
f : add ecx, 1
g : bz L2
h : add ebx, eax
i : br L3
a : t5 = r1 + r2, set CR0
b : bz CR0, L1
c : t6 = mem[t5 + 4]
d : t7 = t6 * 10
e : t8 = t7 + 1
f : t9 = r3 + 1
g : bz CR0, L2
h : t10 = t8 + t5
i : b L3
eax ebx ecx edx
t5 r2 r3 r4
t5 r2 r3 r4
t5 t6 r3 r4
t5 t7 r3 r4
t5 t8 r3 r4
t5 t8 t9 r4
t5 t8 t9 r4
t5 t10 t9 r4
t5 t10 t9 r4
Step 3: Reorder Code
a : t5 = r1 + r2, set CR0
b : bz CR0, L1
c : t6 = mem[t5 + 4]
d : t7 = t6 * 10
e : t8 = t7 + 1
f : t9 = r3 + 1
g : bz CR0, L2
h : t10 = t8 + t5
i : b L3
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
i : b L3
L2: t8 = t7 + 1
eax ebx ecx edx
t5 r2 r3 r4
t5 t6 r3 r4
t5 r2 r3 r4
t5 t7 r3 r4
t5 t8 t9 r4
t5 t8 t9 r4
t5 t8 r3 r4
t5 t10 t9 r4
t5 t10 t9 r4
The following instructions are reordered:
- Load instruction
c
is moved up because of its latency. - Add instruction
e
is moved down because its operandt7
takes time to calculate.
Step 4: Determine Checkpoints
Checkpoints are places where all instructions in the original code have been completed up to that point.
At checkpoints, Guest state == Host state.
Intuitively, checkpoints are similar to save points in game.
We can restart execution from checkpoints just as we can restart the game from save points.
To find checkpoints, we consider "committed (executed)" instructions in the original source code order.
In this "committed" instructions, the latest instruction is the checkpoint.
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
i : b L3
eax ebx ecx edx
t5 r2 r3 r4
t5 t6 r3 r4
t5 r2 r3 r4
t5 t7 r3 r4
t5 t8 t9 r4
t5 t8 t9 r4
t5 t8 r3 r4
t5 t10 t9 r4
t5 t10 t9 r4
a
b,c
d
e,f,g
h
i
@
a
a
c
d
d
d
g
h
NOTE: @
means the initial register mapping.
Step 5: Assign Registers
Assignment of Registers is done as follows:
- Caluculate register live ranges.
- Assign target registers to SSA variables using register live ranges.
For precise state recovery, we need to extend live ranges:
- Checkpoints are save points.
- We need to be able to go back to the checkpoint.
- To do so, we need to keep the register states at the checkpoint.
- In other words, we need to extend register live ranges.
For instruction c
or b
, we need to extend register live ranges at a
.
(x
means extended live ranges.)
r1 r2 r3 r4 t5 t6
| | | | |
x | | | |
x | | |
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
@
a
a
eax ebx ecx edx
a: t5 r2 r3 r4
For instruction d
, we need to extend register live ranges at c
.
r1 r2 r3 r4 t5 t6 t7
| | | | |
x | | | |
x | | | |
| | | | |
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
@
a
a
c
eax ebx ecx edx
c: t5 t6 r3 r4
For instruction e
, we need to extend register live ranges at d
.
r1 r2 r3 r4 t5 t6 t7 t8 t9
| | | | |
x | | | |
x | | | |
| | | | |
| | | | |
x | | |
x | | | |
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
@
a
a
c
d
d
d
eax ebx ecx edx
d: t5 t7 r3 r4
For instruction h
, we need to extend register live ranges at g
.
r1 r2 r3 r4 t5 t6 t7 t8 t9 t10
| | | | |
x | | | |
x | | | |
| | | | |
| | | | |
x | | | |
x | | | | |
| | | | |
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
@
a
a
c
d
d
d
g
eax ebx ecx edx
g: t5 t8 t9 r4
Answer register live ranges
For instruction i
, we need to extend register live ranges at h
.
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
i : b L3
@
a
a
c
d
d
d
g
h
eax ebx ecx edx
h: t5 t10 t9 r4
r1 r2 r3 r4 t5 t6 t7 t8 t9 t10
| | | | |
x | | | |
x | | | |
| | | | |
| | | | |
x | | | |
x | | | | |
| | | | |
| | | |
After calculation of register live ranges, we assign target registers to SSA variables.
Here, we need to note the following:
-
Target registers are assigned SSA variables whose live ranges do not overlap.
- r1 are assigned r1 (before
a
) and t5 (aftera
). - r2 are assigned r2, t7 (before
e
), t8 (aftere
) and t10. - r3 are assigned r3.
- r4 are assigned r4.
- r5 are assigned t6 and t9.
- r1 are assigned r1 (before
r1 r2 r3 r4 t5 t6 t7 t8 t9 t10
| | | | |
x | | | |
x | | | |
| | | | |
| | | | |
x | | | |
x | | | | |
| | | | |
| | | |
a : r1 = r1 + r2,set CR0
c : r5 = mem[r1 + 4]
b : bz CR0, L1
d : r2 = r5 * 10
f : r5 = r3 + 1,set CR0
g : bz CR0, L2
e : r2 = r2 + 1
h : r2 = r2 + r1
i : b L3
eax ebx ecx edx
r1 r2 r3 r4
r1 r5 r3 r4
r1 r2 r3 r4
r1 r2 r3 r4
r1 r2 r5 r4
r1 r2 r5 r4
r1 r2 r3 r4
r1 r2 r5 r4
r1 r2 r5 r4
Step 6: Add Compensation Code + Register Copy
Compensation code and register copy for consistent register mapping is added:
- Compensation code ensures that the moved instructions are executed both under branch-taken and not-taken.
-
Register copy ensures consistent register mapping among all superblocks (or basic blocks).
- We need register copy
r3 = r5
becauseecx
is mapped to r3 at entry.
- We need register copy
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
i : b L3
L2: t8 = t7 + 1
a : r1 = r1 + r2,set CR0
c : r5 = mem[r1 + 4]
b : bz CR0, L1
d : r2 = r5 * 10
f : r5 = r3 + 1,set CR0
g : bz CR0, L2'
e : r2 = r2 + 1
h : r2 = r2 + r1
i : b L3
r3 = r5
L2': r2 = r2 + 1
r3 = r5
Finally, we form target code from intermediate form.
a: add r1, r1, r2
c: lwz r5, 4(r1)
b: beq CR0, L1
d: muli r2, r5, 10
f: addic r5, r3, r1
g: beq CR0, L2'
e: addi r2, r2, 1
h: add r2, r2, r1
i: b L3
mr r3, r5
L2': addi r2, r2, 1
mr r3, r5
b L2
Precise State Recovery
We will check whether this algorithm guarantee compatibility.
To do so, we will see precise state recovery at a trap.
If a trap occurs:
-
Find the trapping superblock and corresponding source basic blocks.
- using reverse translation side table (section 3.6.3).
- Find the checkpoint for the trapping instruction.
- Reconstruct the RegMAP at the checkpoint.
- Interpret next source instructions after the checkpoint.
Instruction d
traps
a : add eax, ebx
b : bz L1
c : mov ebx,[eax + 4]
d : mul ebx, 10
a: add r1, r1, r2
c: lwz r5, 4(r1)
b: beq CR0, L1
d: muli r2, r5, 10
@
a
a
c
eax ebx ecx edx
a: r1 r2 r3 r4
c: r1 r5 r3 r4
b: r1 r2 r3 r4
d: r1 r2 r3 r4
In this example, RegMAP at c
is used for recovery.
Instruction c
traps when b
is taken.
a : add eax, ebx
b : bz L1
c : mov ebx,[eax + 4]
a: add r1, r1, r2
c: lwz r5, 4(r1)
b: beq CR0, L1
@
a
a
eax ebx ecx edx
a: r1 r2 r3 r4
c: r1 r5 r3 r4
b: r1 r2 r3 r4
In target code execution,
c
traps and register states at a
is reconstructed.In source code interpretation,
c
is not executed.There are other methods for precise state recovery:
-
Only use binary translation
- The runtime maintains "translated codes before optimization" and "optimized code".
- If a trap occurs, we reconstruct RegMAP and run "translated codes before optimization".
-
Insert repair code
If a trap occurs, repair code are executed.
Source1: r1 = r2 + r3 2: r9 = r1 + r5 // reg 3: r6 = r1 * r7 // trap occur
Target1: R1 = R2 + R3 3: R6 = R1 * R7 // trap occur 2: R9 = R1 + R5 // reg
Repair code// if R6 = R1 * R7 traps, 2': R9 = R1 + R5
Condition Code Handling
For handling condition codes, an efficient method is lazy evaluation.
We must care about condition codes in reordering.
For precise state recovery, we must recover condition codes.
Condition codes recovery need the following setups:
-
Determine condition code checkpoints
- We only focus on condition code-settings instructions.
- Calculate register live ranges
- Assign target registers
Step 5a: Assign Register with Condition Codes
For instruction b
, we need to re-calculate r1 + r2, set CR0
at a
.
To do so, we need to extend register live ranges just before a
(denoted as @
).
(y
means extended live ranges.)
r1 r2 r3 r4 t5 t6
| | | | |
y x | | | |
y x | | | |
| | | |
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
Checkpoint
@
@
@
@
If d
traps:
- Reconstruct RegMAP at
c
. - Reconstruct condition codes at
@
(by executingr1 + r2, set CR0
). - Interpret source code from
c
.
4.5.3 Superblocks Versus Traces
Most dynamic translation systems use superblocks rather than traces.
In this subsection, we discuss why it is so.
Superblock | Trace | |
---|---|---|
Entry | Single | Multiple |
Exit | Multiple | Multiple |
Join points location | At top | In the middle |
Code duplication | Tail duplication | - |
Instruction Caching
Superblocks have disadvantages:
- Superblocks have replication code by tail duplication.
- Increase the memory size.
- Decrease instruction cache efficiency.
To mitigate it, we convert only heavily used regions of code to superblocks.
Branch Predictors
Superblocks have both advantages and disadvantages:
-
Cons: We need more entries in a branch predictors.
- Ex) There are three branches in example.
-
Pros: We can get path information embedded in superblocks.
- Ex) A-D-E-G go to the loop head. F-G exit the loop.
Optimizations
Superblocks have advantages about optimizations:
-
Join points in traces impose some restrictions.
- Ex) We cannot move reg/mem below join.
Superblocks/Traces Formation Process
Join points in traces have bad effects on transformation process:
-
After trace formation, we must check and keep in mind join points.
- We must check branch target of
C
when discovering it. - We must keep in mind that
G
is join point.
- We must check branch target of
-
After superblock formation, we do not have to check join points.
- We do not have to check branch target of
C
.
- We do not have to check branch target of
Appendix
Static Single Assignment (SSA) Form
SSA is a property of an intermediate representation(IR).
SSA form has the following properties:
- Each variable be assigned once.
- Each variable be defined before it is used.
Benefits
Use-def chains are easily identified.
Easy code sequence in C
x = 100; // def x (?)
x = 200; // def x
y = x; // use x
x1 = 100; // def x1
x2 = 200; // def x2
y1 = x2; // use x2
Static vs. Dynamic
- Static single assignment form
- Assign one variable per one assignment statement (statically)
- Dynamic single assignment form
- Assign one variable per one executed instruction (dynamically)
1 : for (i = 0; i < 100; i++) {
2 : x = i;
3 : y += x;
4 : }
1 : for (i = 0; i < 100; i++) {
2 : x1 = i;
3 : y1 += x1;
4 : }
1 : for (i = 0; i < 100; i++) {
2 : x_i = i;
3 : y_i += x_i;
4 : }
cf. wikipedia