Review + Overview

Review

In Chapter.4 Dynamic Binary Optimization, the discussion is in the following:

  • Typical program behaviour for optimization heuristics (4.1)
  • Profiling techniques to collect profile information (4.2)
  • Forming large translation/optimization blocks using profile (4.3)
  • Overall view of dynamic optimization (4.4)
  • Reordering optimization (4.5)
  • ...

Overview of section 4.4.4 and 4.4.3

4.4 Optimization Framework
Overall view of dynamic optimization
  4.4.1 Approarch
Overall approach to dynamic optimization
  4.4.2 Optimization and Compatibility
Compatibility issues
  4.4.3 Consistent Register Mapping
Register mapping issues
4.5 Code Reordering
Reordering optimization
  4.5.1 Primitive Instruction Reordering
Simple swap reordering
  4.5.2 Implementing a Scheduling Algorithm
Complete reordering algorithm
  4.5.3 Superblocks Versus Traces
Granularity of dynamic optmization

4.4 Optimization Framework

In this section, we discuss the overall issues that are common to all the dynamic optimization methods.

First, we compare dynamic optimization with static optimization:

Static optimization
(cf. Dragon Book)
Dynamic optimization
(in this section)
Timing At compilation
(before execution)
At run time
(during or after execution)
Types of analyzed program High-level language, IR (Binary) Binary (Bytecode)
Analysis granularity Function (intra-procedural),
Overall (inter-procedural)
Trace, Superblock,
(dynamic basic block)
Available information High-level semantics
(e.g., data structure declarations)
Profile information
(e.g., frequency of passing paths)
Desired property Slow but highly optimization Fast, low overhead methods
(because optimization is performed at run time)

4.4.1 Approach

In this subsection, we discuss overall approach to dynamic optimization.

Overall Approach

  1. Collect basic blocks from code cache using profile information
  2. Convert to intermediate (SSA) form; place in buffer
  3. Schedule and optimize
    • Generate side table information (for precise state recovery)
  4. Generate target code
  5. Add compensation code; place in code cache

In brief, overall approach is as follows:

  1. Form superblock
  2. Convert to SSA
  3. Optimize!

Optimization of Traces or Superblocks

  • Trace: multiple entries, multiple exits
    • In optimization, compensation code are inserted into entry and exit points.
  • Superblock: single entry, multiple exits
    • In optimization, compensation code are inserted into exit points.

Compensation code compensates for the code lost due to optimization. We will discuss about compensation code in 4.5.1 Primitive Instruction Reordering.

4.4.2 Optimization and Compatibility

As disscussed in section 3.2, we must maintain compatibility even after optimization.

Review: Compatibility for process VM

We must maintain an isomorphic relationship between guest VM and host platform.
To achive this, we focus on all control transfer points between guest and host:

  • system call
  • return from system call
  • trap : concerns about optimizations
  • interrupt

And at these points, we must form the precise state:

  • same register state
  • same memory state

Trap compatibility

A process VM is trap compatible if:

  • any trap occur during the native execution if the trap occur during emulation.
  • any trap occur during emulation if the trap occur during the native execution.

Note: Except for a page fault (in Chapter.7)

Briefly, Trap timing in guest == Trap timing in host.

Trap compatibility must be maintained before and after optimization.

Remove redundant instructions
Source
r4 = r6 + 1
r1 = r2 + r3  // redundant but may trap
r1 = r4 + r5
r6 = r1 + r7
Target
R4 = R6 + 1
// R1 = R2 + R3 // no trap occur
R1 = R4 + R5
R6 = R1 * R7

In this example, the optimization violate trap compatibility because:

  • In source, r1 = r2 + r3 may trap.
  • In target, no trap occurs because r1 = r2 + r3 is removed.

The converse condition for trap compatibility can be maintained.

Handling of devide-by-zero
Source (not handling devide-by-zero)
r1 = 0
r3 = r2 / r1  // no trap
Target (handling devide-by-zero)
R1 = 0
R3 = R2 / R1  // devide-by-zero trap

In this example, runtime can maintain trap compatibility because:

  • In source, r3 = r2 / r1 do not trap.
  • In target, runtime can ignore divide-by-zero trap while divide-by-zero trap occurs.

Memory and Register Compatibility at Trap Instructions

Memory and register state compatibility are maintained if:

  • the runtime can reconstruct the same memory state as the state on guest VM.
  • the runtime can reconstruct the same register state as the state on guest VM.

Briefly, memory/register state in guest == memory/register state in host.

Memory and register state compatibility must be maintained even after optimization.

Code reordering optimization may violate register state compatibility.
Source
1: r1 = r2 + r3
2: r9 = r1 + r5 // trap occurs
3: r6 = r1 * r7 // time consuming
4: r3 = r6 + 1
Target (NG)
1: R1 = R2 + R3
3: R6 = R1 * R7 // reordered
2: R9 = R1 + R5 // trap occurs
4: R3 = R6 + 1

It might be beneficial to reorder r6 = r1 * r7 to a higher point because multiply take several cycles.
But reordering line.2 and line.3 violates register state compatibility because when r9 = r1 + r5 trap:

  • in souce, r6 has not been updated yet.
  • in target, R6 has been updated.
Target with saved register
1 : R1 = R2 + R3
3': S1 = R6 * R7  // S1 hold the computation result
2 : R9 = R1 + R5  // trap occurs
3 : R6 = S1       // R6 is updated in the original order
4 : R3 = S1 + 1

To maintain register state compatibility, saved register S1 hold the multiply results.
This optimized code update registers in the same order as original source.

Summary

After optimization, we must maintain compatibility:

Trap compatibility
Trap timing in guest == Trap timing in host
Memory/Register compatibility
Memory/Register state in guest == Memory/Register state in host
Answer whether this optimization maintains compatibility or not.
Before optimization
1 : mem[r1] = r2
2 : r1 = r2 * r3
3 : ...
After optimization
2': S1 = r2 * r3
1 : mem[r1] = r2
2 : r1 = S1
3 : ...
Yes!! because saved register have backup of r1.
Before optimization
1 : r1 = r2 + r3
2 : if false br exit
3 : r4 = mem[r1]
4 : ...
After optimization
1 : r1 = r2 + r3
// if false br exit
3 : r4 = mem[r1];
4 : ...
Yes!! because branch instructions cannot trap.

4.4.3 Consistent Register Mapping

Register mapping was discussed in Section 2.5, 2.8.1, 3.3.1.
In optimization, consistent register mapping is important for performance.

There are three cases for register mapping:

  • source register << target register
    • Each source register is permanently mapped to the corresponding target register.
    • Methods: fixed register mapping
  • source register ≈ target register
  • source register > target register
    • All source registers cannot be permanently mapped to target registers.
    • Two types of mapping methods:
      1. Inconsistent Register Mapping
      2. Consistent Register Mapping

Inconsistent Register Mapping

Inconsistent Register Mapping has the following properties:

  • Source registers are mapped on a per-translation-block basis.
    • Different translation blocks may have different register mappings.
    • Ex) While r1 is used for R1 in block A, it is also used for R2 in block B.
  • Extra register copy is required for consistency.
Inconsistent Register Mapping

Source

Block A
R1 = 100
br C
Block B
R1 = 200
if cond br C
Block C
C:
  R5 = R4 + R1
  ...

Target with inconsistent register mapping

Block A
// R1 <-> r1
r1 = 100
br C
Block B
// R1 <-> r2
r2 = 200
if cond, br C
Block C
// R1 <-> r3
C:
  r5 = r4 + r3
  ...

This target code have a problem about R1 register mapping because:

  • r1 is used in superblock A.
  • r2 is used in superblock B
  • r3 is used in superblock C

Target with extra register copy

Block A
// R1 <-> r1
r1 = 100;
br C'
C': r3 = r1 // R1 <-> r3
    br C
Block B
// R1 <-> r2
r2 = 200
if cond, br C''
C'': r3 = r2 // R1 <-> r3
     if cond, br C
Block C
// R1 <-> r3
C:
  r5 = r4 + r3
  ...

If translated codes jump to interpreter, the runtime must copy target register values into the register context block, too.

Consistent Register Mapping

Consistent Register Mapping has the following properties:

  • Some source registers are mapped to target registers among translation blocks.
    • R1 are mapped to r1 in all translation blocks.
  • Other source registers are mapped to memory region.
    • R10 are mapped to register context block.
  • Memory accesses may occur.
Simple Fixed Register Mapping

Source

R1 = 100
R10 = 200
br F
R1 = 300
R10 = 400
br F
F:
  R9 = R10 + R1
  ...

Target with consistent register mapping

// R1 <-> r1
// R10 <-> reg[10]
r1 = 100
reg[10] = 200
br F
// R1 <-> r1
// R10 <-> reg[10]
r1 = 300
reg[10] = 400
br F
// R9 <-> reg[9]
F:
  reg[9] = reg[10] + r1
  ...

This example shows that memory accesses (reg[9], reg[10]) occur.

In Section 4.6.3, consistent but not fixed methods are discussed.

4.5 Code Reordering

Code Reordering is an important optimization because:

  • In simple pipeline microarchitectures, instructions are executed strictly in program order.
    • Code Reordering improve pipeline performance.
  • Source binary is not ordered for target processors.
    • In VLIW architectures, VLIW processors assume that the compiler reorder instructions to gather independent instructions.
  • In dynamic out-of-order superscalar processors:
    • Code Reordering have an effect on functional unit latencies (especially cache memory latencies).

Because of code reordering's importance, we consider it before looking at other optimizations.
We first discuss several important issues about code reordering.

4.5.1 Primitive Instruction Reordering

In this subsection, we consider reordering pairs of instructions (swapping one for another).
We discuss in the following steps:

  1. Categorize the instructions into four categries.
    • reg: Register updates
    • mem: Memory updates
    • br: Branch instructions
    • join: Join points
  2. Discuss pairwise-reordering (4 × 4 = 16).
    • br ⇆ another
    • join ⇆ another
    • remaining pairs

Four instruction categories

  • reg: Register updates
    • Description: instructions that produce a register result
    • Property: can be "undone" (using saved registers, saving into memory)
      • Register access is fast.
    • Examples:
      • mov rax, [rbx] // rax = mem[rbx]
      • add r1, r2, r3 // r1 = r2 + r3
  • mem: Memory updates
    • Description: instructions that place a value in a memory
    • Property: cannot be "undone" easily (because saving memory have high cost of time and memory)
      • Memory access is slow.
    • Examples:
      • mov [rax], rbx // mem[rax] = rbx
      • stw r1, 4(r2) // mem[r2 + 4] = r1
  • br: Branch instructions
    • Property: do not trap
    • Example:
      bz label // if ZF == 0 then br label
  • join: Join points
    • Description: the points where a jump or branch target enters the code sequence
    • Property: join are not instructions but can affect reordering
    • Example:
        br label
        ...
      label: // join points
        ...

In addition, volatile memory access is categorized as mem.

  • Access to volatile memory locations
    • A volatile memory location may be accessed by an entity other than the process.
      • Shared memory with other process or thread
      • Memory-mapped I/O with I/O system
    • Property: cannot be reordered in many cases:
      • We cannot move an access to shared memory before getting lock (mutex).
      • An access to memory-mapped I/O may trigger some I/O device functions.
    • Category: mem (because state cannot be recovered easily)

In some VM applications, the emulation system must optimize conservatively because it does not know whether the memory access is volatile or not.

Four instruction categories
      R1 = mem[R6]        // reg (mem if mem[R6] is volatile)
      R2 = mem[R6 + 4]    // reg (mem if mem[R6 + 4] is volatile)
      R3 = R1 + 1         // reg
      R4 = R1 << 2        // reg
      br exit if R7 == 0  // br
      R7 = R7 + 1         // reg
      mem[R6] = R3        // mem
exit:                     // join

We will now discuss the reordering of two adjacent instructions.
There are two keys to understanding well:

  • Are memory and register states recoverable at trap instructions?
    • If so, the reordering is allowed.
    • If not, the reordering is not allowed.
  • Does the reordering change control flow?
    • If so, the reordering is not allowed.
    • If not, the reordering is allowed.

br ⇆ another

Move reg or mem below br

We can move reg or mem below br because:

  • The moved instruction is executed both under branch-taken and not-taken.
  • br do not trap.
  • Not change control flow.
Move reg below br

Before reordering

1 : R3 = R1 + 1
2 : R4 = R1 << 2        // reg
3 : br exit if R7 == 0  // br
4 : R7 = R7 + 1
100 : exit:
101 :   ...

After reordering

1 : R3 = R1 + 1
3 : br 2' if R7 == 0    // br
2 : R4 = R1 << 2        // reg
4 : R7 = R7 + 1
Compensation Code
2': R4 = R1 << 2
  : br exit
100 : exit:
101 :   ...

Compensation code ensures that the state is the same as original code.

Move reg above br

We can move reg above br because:

  • The original register value of R is remained while the moved instruction is executed.
  • Not change control flow.
Move reg above br
Before reordering
1 : R2 = R1 << 2
2 : br exit if R8 == 0  // br
3 : R6 = R7 * R2        // reg
4 : mem[R6] = R3
5 : R6 = R2 + 2
After reordering
1 : R2 = R1 << 2
3': T1 = R7 * R2        // saved register
2 : br exit if R8 == 0  // br
3 : R6 = T1             // reg
4 : mem[R6] = R3
5 : R6 = R2 + 2

There are two reasons why we insert saved register T1:

  • R6 must be live at label exit.
  • R6 must be live until the original position (between ins.2 and ins.4).
    • Live means that the original values are held.

If we do not use R6, we can omit the explicit register copy.
Important point is that the correct value of R6 is available (R6 holds old value, T1 holds new value).

Omit register copy
1 : R2 = R1 << 2
3': T1 = R7 * R2        // saved register
2 : br exit if R8 == 0  // br
// 3: R6 = T1
4 : mem[T1] = R3
5 : R6 = R2 + 2
Move mem above br

We cannot move mem above br because:

  • The moved instruction is executed both under branch-taken and not-taken.
    • Originally, the instruction is executed only under branch-not-taken.
  • mem cannot be "undone". (or need extra instructions)
Move mem above br

Before reordering

1 : br exit if R8 == 0  // br
2 : mem[R1] = R2        // mem
3 : ...
4 : exit:
5 :   ...

After reordering

2 : mem[R1] = R2        // mem
1 : br exit if R8 == 0  // br
3 : ...
4 : exit:
// cannot recover mem[R1]
5 :   ...

This example shows that mem[R1] cannot be recoverable.

Memory compatible but slow reordering

2': T1 = mem[R1]        // extra load
2 : mem[R1] = R2        // mem
1 : br 2'' if R8 == 0   // br
3 : ...
2'':   mem[R1] = T1  // recover
   :   br exit
4 : exit:
5 :   ...

This reordering looks good, but an extra memory access is added.
From this, we assume that mem cannot be recoverable ("undone").

Move br above/below join

We cannot move br above/below join because:

  • Change control flow.
Move br above join
Before reordering
1 : ...
2 : br label
3 : label:                // join
4 :   br exit if R1 == 0  // br
After reordering
1 : ...
2 : br label
4 :   br exit if R1 == 0  // br
3 : label:                // join

This example shows that moving br around join changes control flow.

Swap br and br

We cannot swap br and br because:

  • Change control flow.
Before optimization
1 : ...
2 : br label1  // br
3 : br label2  // br
After optimization
1 : ...
3 : br label2  // br
2 : br label1  // br

Obviously control flow is changed.

Answer whether this optimization is allowed and why.
Before reordering
1 : if cond br label
2 : r1 = mem[r2]
After reordering
2 : r1 = mem[r2]
1 : if cond br label
No. A saved register is needed.
2': t1 = mem[r2]
1 : if cond br label
2 : r1 = t1
Before reordering
1 : mem[R1] = R2
2 : br label
3 : label:
4 :   ...
After reordering
2 : br label
1 : mem[R1] = R2
3 : label:
4 :   ...
No. Compensation code is needed.
2 : br label'
1 : mem[R1] = R2
Compensation code
5 : label':
6 :   mem[R1] = R2
7 :   br label

join ⇆ another

Move reg/mem above join

We can move reg/mem above join because:

  • The moved instruction is executed both in fallthrough and join-from-outside (by inserting compensation code).
  • Not change control flow.
Move reg above join
Before reordering
1 : ...
2 : br label
  :   ...
3 : label:          // join
4 :   R7 = mem[R6]  // reg
5 :   ...
After reordering
1 : ...
2 : br 4'
4': R7 = mem[R6]   // compensation
  : br label
  :   ...
4 :   R7 = mem[R6]  // moved above
3 : label:          // join
5 :   ...

This reordering alone is meaningless, but this enable more optimizations.

Move reg or mem below join

We cannot move reg/mem below join because:

  • The moved instruction is executed both in fallthrough and join-from-outside.
    • Originally, the instruction is executed only in fallthrough.
Move reg below join
Before reordering
1 : R1 = 100
2 : br label
3 :   R1 = 200      // reg
4 : label:          // join
5 :   ...
After reordering
1 : R1 = 100
2 : br label
4 : label:          // join
3 :   R1 = 200      // reg
5 :   ...

This example shows that execution path 2-4-3 changes R1 value.

Swap join points

We can swap join and join because:

  • Swapping join and join have no effect.
    • join is a point and not an instruction.
Swap join points
Before reordering
1 : R1 = 100
2 : br label1
3 : R2 = 200
4 : br label2
5 : label1:     // join
6 : label2:     // join
7 :   R3 = R1 + R2
After reordering
1 : R1 = 100
2 : br label1
3 : R2 = 200
4 : br label2
5 : label2:     // join
6 : label1:     // join
7 :   R3 = R1 + R2

This reordering have no effect clearly.

Answer whether this optimization is allowed and why.
Before reordering
1 : ...
2 : br label
3 : label:           // join
4 :   mem[R1] = R2   // mem
5 :   ...
After reordering
1 : ...
2 : br label'
4': label'
  :   mem[R1] = R2
  :   br label
4 :   mem[R1] = R2   // mem
3 : label:           // join
5 :   ...
Yes. Compensation code ensures that mem[R1] = R2 is executed both in fallthrough and join-from-outside.
Before reordering
  : ...
1 : mem[R1] = 100
2 : br label
  :   ...
4 :   mem[R1] = 200  // mem
5 : label:           // join
After reordering
  : ...
1 : mem[R1] = 100
2 : br label
  :   ...
5 : label:           // join
4 :   mem[R1] = 200  // mem
No. The value of mem[R1] is changed in join-from-outside.

Straight-line reordering (reg/mem ⇆ reg/mem)

Move reg above reg/mem

We can move reg above reg/mem because:

  • reg can be "undone" using saved registers.
  • Not change control flow.
Move reg above mem
Before reordering
1 : R1 = R1 * 3
2 : mem[R6] = R1  // mem
3 : R7 = R7 << 3  // reg
4 : R9 = R7 + R2
After reordering
1 : R1 = R1 * 3
3': T1 = R7 << 3  // saved reg
2 : mem[R6] = R1  // mem
3 : R7 = T1       // reg
4 : R9 = T1 + R2

If mem[R6] = R1 traps, R7 holds the same value as source.

Move mem above reg/mem

We cannot move mem above reg/mem because:

  • mem cannot be "undone".
Move mem above reg
Before reordering
1 : R1 = R2 + R3    // reg
2 : mem[R4] = 100   // mem
3 : ...
After reordering
2 : mem[R4] = 100   // mem
1 : R1 = R2 + R3    // reg
3 : ...

We assume that mem cannot be recoverable ("undone") before.
If 1: R1 = R2 + R3 trap, we cannot recover the original value of mem[R4].

Answer whether this optimization is allowed and why.
Before reordering
1 : R1 = R2 + R3  // reg
2 : R4 = R1 * R5  // reg
After reordering
2': T1 = R1 * R5  // saved register
1 : R1 = R2 + R3  // reg
2 : R4 = T1       // reg
No. We must use the value of R1 = R2 + R3 at T1 = R1 * R5.

Summary

First (moved down)
Second (moved up) reg mem br join
reg Extend live range of reg instruction Extend live range of reg instruction Extend live range of reg instruction Add compensation code at entrance
mem Not allowed Not allowed Not allowed Add compensation code at entrance
br Add compensation code at branch exit Add compensation code at branch exit Not allowed (changes cntrol flow) Not allowed (changes cntrol flow)
join Not allowed (can only be done in rare cases) Not allowed (can only be done in rare cases) Not allowed (changes cntrol flow) No effect

Possible reordering is the followings:

  • Move reg above any instructions (using saved registers)
    • Register states can be recoverable.
  • Move reg or mem below br (by adding compensation code)
  • Move mem(/reg) above join (by adding compensation code)

4.5.2 Implementing a Scheduling Algorithm

In this subsection, we consider a complete code-scheduling algorithm.
This algorithm have the following properties:

  • Optimization granularity: superblock
  • Register states are always recoverable.
    • Register live ranges are extended by backups.
  • Handling precise traps:
    1. Recover the original register state at the checkpoint.
    2. Interpret the original source code from the checkpoint.

We will use the translations from IA-32 to PowerPC as an example.

Original Source Code
a : add eax, ebx        ; eax += ebx
b : bz  L1              ; br L1 if ZF == 0
c : mov ebx, [eax + 4]  ; ebx = mem[eax + 4]
d : mul ebx, 10         ; ebx *= 10
e : add ebx, 1          ; ebx += 1
f : add ecx, 1          ; ecx += 1
g : bz  L2              ; br L2 if ZF == 0
h : add ebx, eax        ; ebx += eax
i : br  L3              ; br L3

Step 1: Translate to Single-Assignment Form

Original source code is translated to target code and placed in code cache.

Original Source Code
a : add eax, ebx        ; eax += ebx
b : bz  L1              ; br L1 if ZF == 0
c : mov ebx, [eax + 4]  ; ebx = mem[eax + 4]
d : mul ebx, 10         ; ebx *= 10
e : add ebx, 1          ; ebx += 1
f : add ecx, 1          ; ecx += 1
g : bz  L2              ; br L2 if ZF == 0
h : add ebx, eax        ; ebx += eax
i : br  L3              ; br L3
Translated Target Code
a : add r1, r1, r2
b : bz CR0, L1
c : lwz r2, 4(r1)
d : mul r2, r2, 10
e : add r2, r2, 1
f : add r3, r3, 1
g : bz CR0, L2
h : add r2, r1, r2
i : b L3

The translated target code is:

  • converted in SSA form.
  • pushed into a scheduling buffer.
Translated Target Code
a : add r1, r1, r2
b : bz CR0, L1
c : lwz r2, 4(r1)
d : mul r2, r2, 10
e : add r2, r2, 1
f : add r3, r3, 1
g : bz CR0, L2
h : add r2, r1, r2
i : b L3
SSA-Formed Target Code
a : t5 = r1 + r2, set CR0
b : bz CR0, L1
c : t6 = mem[t5 + 4]
d : t7 = t6 * 10
e : t8 = t7 + 1
f : t9 = r3 + 1
g : bz CR0, L2
h : t10 = t8 + t5
i : b L3

Step 2: Form Register Map

Register Map (RegMAP) maintains a mapping from the source registers to the single-assignment registers.

Original Source Code
 
a : add eax, ebx
b : bz  L1           
c : mov ebx, [eax + 4]
d : mul ebx, 10
e : add ebx, 1
f : add ecx, 1
g : bz  L2
h : add ebx, eax
i : br  L3
SSA-Formed Target Code
 
a : t5 = r1 + r2, set CR0
b : bz CR0, L1
c : t6 = mem[t5 + 4]
d : t7 = t6 * 10
e : t8 = t7 + 1
f : t9 = r3 + 1
g : bz CR0, L2
h : t10 = t8 + t5
i : b L3
Register Map (RegMAP)
eax   ebx   ecx   edx
t5    r2    r3    r4
t5    r2    r3    r4
t5    t6    r3    r4
t5    t7    r3    r4
t5    t8    r3    r4
t5    t8    t9    r4
t5    t8    t9    r4
t5    t10   t9    r4
t5    t10   t9    r4

Step 3: Reorder Code

Before Scheduling
 
a : t5 = r1 + r2, set CR0
b : bz CR0, L1
c : t6 = mem[t5 + 4]
d : t7 = t6 * 10
e : t8 = t7 + 1
f : t9 = r3 + 1
g : bz CR0, L2
h : t10 = t8 + t5
i : b L3
After Scheduling
 
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
i : b L3
Compensation Code
L2: t8 = t7 + 1
Register Map (RegMAP)
eax   ebx   ecx   edx
t5    r2    r3    r4
t5    t6    r3    r4
t5    r2    r3    r4
t5    t7    r3    r4
t5    t8    t9    r4
t5    t8    t9    r4
t5    t8    r3    r4
t5    t10   t9    r4
t5    t10   t9    r4

The following instructions are reordered:

  • Load instruction c is moved up because of its latency.
  • Add instruction e is moved down because its operand t7 takes time to calculate.

Step 4: Determine Checkpoints

Checkpoints are places where all instructions in the original code have been completed up to that point.
At checkpoints, Guest state == Host state.

Intuitively, checkpoints are similar to save points in game.
We can restart execution from checkpoints just as we can restart the game from save points.

To find checkpoints, we consider "committed (executed)" instructions in the original source code order.
In this "committed" instructions, the latest instruction is the checkpoint.

After Scheduling
 
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
i : b L3
Register Map (RegMAP)
eax   ebx   ecx   edx
t5    r2    r3    r4
t5    t6    r3    r4
t5    r2    r3    r4
t5    t7    r3    r4
t5    t8    t9    r4
t5    t8    t9    r4
t5    t8    r3    r4
t5    t10   t9    r4
t5    t10   t9    r4
Commit
 
a

b,c
d


e,f,g
h
i
Checkpoint
 
@
a
a
c
d
d
d
g
h

NOTE: @ means the initial register mapping.

Step 5: Assign Registers

Assignment of Registers is done as follows:

  1. Caluculate register live ranges.
  2. Assign target registers to SSA variables using register live ranges.

For precise state recovery, we need to extend live ranges:

  • Checkpoints are save points.
  • We need to be able to go back to the checkpoint.
  • To do so, we need to keep the register states at the checkpoint.
  • In other words, we need to extend register live ranges.

For instruction c or b, we need to extend register live ranges at a.
(x means extended live ranges.)

Register Live Ranges
r1 r2 r3 r4 t5 t6
|  |  |  |  |
   x  |  |  |  |
   x  |  |  |
Scheduled Code
 
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
Checkpoint
 
@
a
a
RegMAP
   eax   ebx   ecx   edx
a: t5    r2    r3    r4

For instruction d, we need to extend register live ranges at c.

Register Live Ranges
r1 r2 r3 r4 t5 t6 t7
|  |  |  |  |
   x  |  |  |  |
   x  |  |  |  |
      |  |  |  |  |
Scheduled Code
 
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
Checkpoint
 
@
a
a
c
RegMAP
   eax   ebx   ecx   edx
c: t5    t6    r3    r4

For instruction e, we need to extend register live ranges at d.

Register Live Ranges
r1 r2 r3 r4 t5 t6 t7 t8 t9
|  |  |  |  |
   x  |  |  |  |
   x  |  |  |  |
      |  |  |  |  |
      |  |  |     |     |
      x  |  |     |
      x  |  |     |  |
Scheduled Code
 
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
Checkpoint
 
@
a
a
c
d
d
d
RegMap
   eax   ebx   ecx   edx
d: t5    t7    r3    r4

For instruction h, we need to extend register live ranges at g.

Register Live Ranges
r1 r2 r3 r4 t5 t6 t7 t8 t9 t10
|  |  |  |  |
   x  |  |  |  |
   x  |  |  |  |
      |  |  |  |  |
      |  |  |     |     |
      x  |  |     |     |
      x  |  |     |  |  |
         |  |        |  |  |
Scheduled Code
 
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
Checkpoint
 
@
a
a
c
d
d
d
g
RegMAP
   eax   ebx   ecx   edx
g: t5    t8    t9    r4
Answer register live ranges

For instruction i, we need to extend register live ranges at h.

Scheduled Code
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
i : b L3
Checkpoint
@
a
a
c
d
d
d
g
h
RegMAP
   eax   ebx   ecx   edx
h: t5    t10   t9    r4
Register Live Ranges
r1 r2 r3 r4 t5 t6 t7 t8 t9 t10
|  |  |  |  |
   x  |  |  |  |
   x  |  |  |  |
      |  |  |  |  |
      |  |  |     |     |
      x  |  |     |     |
      x  |  |     |  |  |
         |  |        |  |  |
         |  |           |  |

After calculation of register live ranges, we assign target registers to SSA variables.
Here, we need to note the following:

  • Target registers are assigned SSA variables whose live ranges do not overlap.
    • r1 are assigned r1 (before a) and t5 (after a).
    • r2 are assigned r2, t7 (before e), t8 (after e) and t10.
    • r3 are assigned r3.
    • r4 are assigned r4.
    • r5 are assigned t6 and t9.
Register Live Ranges
r1 r2 r3 r4 t5 t6 t7 t8 t9 t10
|  |  |  |  |
   x  |  |  |  |
   x  |  |  |  |
      |  |  |  |  |
      |  |  |     |     |
      x  |  |     |     |
      x  |  |     |  |  |
         |  |        |  |  |
         |  |           |  |
After Assignment
 
a : r1 = r1 + r2,set CR0
c : r5 = mem[r1 + 4]
b : bz CR0, L1
d : r2 = r5 * 10
f : r5 = r3 + 1,set CR0
g : bz CR0, L2
e : r2 = r2 + 1
h : r2 = r2 + r1
i : b L3
Register Map(RegMAP)
eax  ebx  ecx  edx
r1   r2   r3   r4
r1   r5   r3   r4
r1   r2   r3   r4
r1   r2   r3   r4
r1   r2   r5   r4
r1   r2   r5   r4
r1   r2   r3   r4
r1   r2   r5   r4
r1   r2   r5   r4

Step 6: Add Compensation Code + Register Copy

Compensation code and register copy for consistent register mapping is added:

  • Compensation code ensures that the moved instructions are executed both under branch-taken and not-taken.
  • Register copy ensures consistent register mapping among all superblocks (or basic blocks).
    • We need register copy r3 = r5 because ecx is mapped to r3 at entry.
Step 3
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
f : t9 = r3 + 1
g : bz CR0, L2
e : t8 = t7 + 1
h : t10 = t8 + t5
i : b L3
Compensation Code
L2: t8 = t7 + 1
Compensation Code Added
a : r1 = r1 + r2,set CR0
c : r5 = mem[r1 + 4]
b : bz CR0, L1
d : r2 = r5 * 10
f : r5 = r3 + 1,set CR0
g : bz CR0, L2'
e : r2 = r2 + 1
h : r2 = r2 + r1
i : b L3
    r3 = r5
Compensation Code
L2': r2 = r2 + 1
     r3 = r5

Finally, we form target code from intermediate form.

Power PC Code
a: add    r1, r1, r2
c: lwz    r5, 4(r1)
b: beq    CR0, L1
d: muli   r2, r5, 10
f: addic  r5, r3, r1
g: beq    CR0, L2'
e: addi   r2, r2, 1
h: add    r2, r2, r1
i: b      L3
   mr     r3, r5
Compensation Code
L2': addi r2, r2, 1
     mr   r3, r5
     b    L2

Precise State Recovery

We will check whether this algorithm guarantee compatibility.
To do so, we will see precise state recovery at a trap.

If a trap occurs:

  1. Find the trapping superblock and corresponding source basic blocks.
    • using reverse translation side table (section 3.6.3).
  2. Find the checkpoint for the trapping instruction.
  3. Reconstruct the RegMAP at the checkpoint.
  4. Interpret next source instructions after the checkpoint.
Instruction d traps
Source Code
 
a : add eax, ebx
b : bz  L1           
c : mov ebx,[eax + 4]
d : mul ebx, 10
Target Code
 
a: add  r1, r1, r2
c: lwz  r5, 4(r1)
b: beq  CR0, L1
d: muli r2, r5, 10
Checkpoint
 
@
a
a
c
Register Map
   eax  ebx  ecx  edx
a: r1   r2   r3   r4
c: r1   r5   r3   r4
b: r1   r2   r3   r4
d: r1   r2   r3   r4

In this example, RegMAP at c is used for recovery.

Instruction c traps when b is taken.
Source Code
 
a : add eax, ebx
b : bz  L1           
c : mov ebx,[eax + 4]
Target Code
 
a: add  r1, r1, r2
c: lwz  r5, 4(r1)
b: beq  CR0, L1
Checkpoint
 
@
a
a
Register Map
   eax  ebx  ecx  edx
a: r1   r2   r3   r4
c: r1   r5   r3   r4
b: r1   r2   r3   r4
In target code execution, c traps and register states at a is reconstructed.
In source code interpretation, c is not executed.

There are other methods for precise state recovery:

  • Only use binary translation
    • The runtime maintains "translated codes before optimization" and "optimized code".
    • If a trap occurs, we reconstruct RegMAP and run "translated codes before optimization".
  • Insert repair code

    If a trap occurs, repair code are executed.

    Source
    1: r1 = r2 + r3
    2: r9 = r1 + r5   // reg
    3: r6 = r1 * r7   // trap occur
    Target
    1: R1 = R2 + R3
    3: R6 = R1 * R7   // trap occur
    2: R9 = R1 + R5   // reg
    Repair code
    // if R6 = R1 * R7 traps,
    2': R9 = R1 + R5

Condition Code Handling

For handling condition codes, an efficient method is lazy evaluation.

We must care about condition codes in reordering.
For precise state recovery, we must recover condition codes.
Condition codes recovery need the following setups:

  • Determine condition code checkpoints
    • We only focus on condition code-settings instructions.
  • Calculate register live ranges
  • Assign target registers

Step 5a: Assign Register with Condition Codes

For instruction b, we need to re-calculate r1 + r2, set CR0 at a.
To do so, we need to extend register live ranges just before a (denoted as @).
(y means extended live ranges.)

Register Live Ranges
r1 r2 r3 r4 t5 t6
|  |  |  |  |
y  x  |  |  |  |
y  x  |  |  |  |
      |  |  |  |
Scheduled Code
 
a : t5 = r1 + r2, set CR0
c : t6 = mem[t5 + 4]
b : bz CR0, L1
d : t7 = t6 * 10
Condition Codes
Checkpoint
@
@
@
@

If d traps:

  • Reconstruct RegMAP at c.
  • Reconstruct condition codes at @ (by executing r1 + r2, set CR0).
  • Interpret source code from c.

4.5.3 Superblocks Versus Traces

Most dynamic translation systems use superblocks rather than traces.
In this subsection, we discuss why it is so.

Superblock Trace
Entry Single Multiple
Exit Multiple Multiple
Join points location At top In the middle
Code duplication Tail duplication -
Superblocks
Traces

Instruction Caching

Superblocks have disadvantages:

  • Superblocks have replication code by tail duplication.
  • Increase the memory size.
  • Decrease instruction cache efficiency.

To mitigate it, we convert only heavily used regions of code to superblocks.

Branch Predictors

Superblocks have both advantages and disadvantages:

  • Cons: We need more entries in a branch predictors.
    • Ex) There are three branches in example.
  • Pros: We can get path information embedded in superblocks.
    • Ex) A-D-E-G go to the loop head. F-G exit the loop.

Optimizations

Superblocks have advantages about optimizations:

  • Join points in traces impose some restrictions.
    • Ex) We cannot move reg/mem below join.

Superblocks/Traces Formation Process

Join points in traces have bad effects on transformation process:

  • After trace formation, we must check and keep in mind join points.
    • We must check branch target of C when discovering it.
    • We must keep in mind that G is join point.
  • After superblock formation, we do not have to check join points.
    • We do not have to check branch target of C.

Appendix

Static Single Assignment (SSA) Form

SSA is a property of an intermediate representation(IR).

SSA form has the following properties:

  • Each variable be assigned once.
  • Each variable be defined before it is used.

Benefits

Use-def chains are easily identified.

Easy code sequence in C
x = 100;    // def x (?)
x = 200;    // def x
y = x;      // use x
x1 = 100;   // def x1
x2 = 200;   // def x2
y1 = x2;    // use x2

Static vs. Dynamic

Static single assignment form
Assign one variable per one assignment statement (statically)
Dynamic single assignment form
Assign one variable per one executed instruction (dynamically)
1 : for (i = 0; i < 100; i++) {
2 :   x = i;
3 :   y += x;
4 : }
Static assignment
1 : for (i = 0; i < 100; i++) {
2 :   x1 = i;
3 :   y1 += x1;
4 : }
Dynamic assignment
1 : for (i = 0; i < 100; i++) {
2 :   x_i = i;
3 :   y_i += x_i;
4 : }

cf. Dynamic Single Assignment

cf. wikipedia