Virtual Machines Ch.2
Emulation: Interpretation and Binary Translation
Overview + Jargon
Goal of this chapter: Understand mechanism of instructions emulation, the process on which virtual machines are based.
-
Emulation: the process of implementing interfaces and functionality for one system with different interfaces and functionality.
- Terminal emulator provide interfaces and functionality for terminal.
- Game Boy emulator provide interfaces and functionality for Game Boy on different hadware(e.g., PC, Smartphone).
- Virtual Machine provide interfaces and functionality for guest system.
- Instruction set emulator provide interfaces and functionality for guest’s ISA.
-
ISA (Instruction Set Architecture): interface between software and hardware
- define
- instructions
- register and memory architecture
- trap and interrupt architecture
- define
-
Instruction set emulation
- emulate guest’s (source) ISA using host’s (target) ISA.
- ex) Rosetta2 emulate x86_64 using arm64.
- consist of
- user-level instructions emulation (Chapter 2)
- memory architecture, traps and interrupts emulation (Chapter 3, 4)
- Methods of instruction set emulation
- Interpretation
- repeat the following:
- fetch a source instruction
- analyze it
- perform the required operation
- Pros: smaller initial cost
- Cons: bigger execution cost
- repeat the following:
- Binary translation
- translate a source instructions block into a target instructions block
- save it for repeated use
- Pros: smaller execution cost
- Cons: bigger initial cost
- Interpretation
Goal of my turn: Understand interpretation, methods of instruction set emulation. (Binary translation is mainly explaned by Seo-san)
Table of Contents
2.1 Basic Interpretation
2.2 Threaded Interpretation
2.3 Predecoding and Direct Threaded Interpretation
2.4 Interpreting a Complex Instruction Set
2.5 Binary Translation
Fundamental Knowledge: ISA
ISA(Instruction Set Architecture) is interface between software and hardware.
Classification of ISAs
- RISC(Reduced Instruction Set Computer)
- CISC(Complex Instruction Set Computer)
- MISC(Minimal Instruction Set Computer)
- VLIW(Very Long Instruction Word)
- EPIC(Explicity Parallel Instruction Computing)
…
In this chapter, RISC and CISC are described.
In reality, RISC and CISC are most common ISA.
RISC (Reduced Instruction Set Computer)
- Instruction set philosopy
- Single instruction execute fewer operation
- simple instructions
- reduce the numbers of cpu cycles per one instruction
- one cycle per one instruction
- ※“reduced” doesn’t mean a smaller set of instructions
- Single instruction execute fewer operation
- Instruction format
- Fixed-length instructions
- Fetch, decode is easy
- Fixed-length instructions
- RISC family
- ARM architecture
- MIPS architecture
- SPARC
- PowerPC (used to describe)
ex) Instruction format of PowerPC (D-form)
RT: target register
RA: source register
CISC (Complex Instruction Set Computer)
- Instruction set philosopy
- Single instructions can execute multiple operations (such as a load from memory, an arithmetic operation, and a memory store)
- complex instructions
- may take some cpu cycles per one instruction
- Single instructions can execute multiple operations (such as a load from memory, an arithmetic operation, and a memory store)
- Instruction format
- Variable-length instructions
- Fetch, decode is hard
- Variable-length instructions
- CISC family
- x86 (including IA-32 (used to describe))
- z/Architecture
ex) Instruction format of IA-32
Q. Check ISA of your PC. Answer the ISA is RISC or CISC.
Hint:
Linux or Mac) $ uname -m
Windows) $ systeminfo
2.1 Basic Interpretation
Decode-and-dispatch interpretation is simple and basic interpretation.
Interpreter Overview
Interpreter code must perform the following:
- emulate the complete architected state of a guest machine
- registers and memory architecture
- trap and interrupt architecture
- interpret a source code and modify the source state
Decode-and-Dispatch Interpreter: Simple interpreter
Decode-and-dispatch interpreter interprets one instruction at a time, and repeats.
Method
Decode-and-dispatch interpreter repeats the following:
- fetch a source instruction
- decode it (to identify what the instruction is)
- dispatch it to an interpretation routine (to modify the source state)
ex) Decode-and-dispatch interpreter code for PowerPC ISA.
// main interpreter loop
while (!halt && !interrupt) {
inst = code[PC]; // fetch one instruction
opcode = extract(inst, 31, 6); // decode it
switch(opcode) {
// dispatch it to interpretation routine
case LoadWordAndZero: LoadWordAndZero(inst); break;
case ALU: ALU(inst); break;
case Branch: Branch(inst); break;
...
}
}
// interpretation routine list
// Load a 32bit word into a 64bit register and zero the upper 32bit
LoadWordAndZero(inst) {
RT = extract(inst, 25, 5);
RA = extract(inst, 20, 5);
displacement = extract(inst, 15, 16);
if (RA == 0)
source = 0;
else
source = regs[RA];
address = source + displacement;
regs[RT] = (data[address] << 32) >> 32;
PC = PC + 4;
}
// Arithmetic-logic unit
ALU(inst) {
RT = extract(inst, 25, 5);
RA = extract(inst, 20, 5);
RB = extract(inst, 15, 5);
source1 = regs[RA];
source2 = regs[RB];
extended_opcode = extract(inst, 10, 10);
// ALU instruction have dispatch loop
switch(extended_opcode) {
case Add: Add(inst);
case AddCarrying: AddCarrying(inst);
case AddExtended: AddExtended(inst);
...
}
PC = PC + 4;
}
...
* Example code is written in high-level language, but it can be written in assembly language for higher performance easily.
Pros and Cons
Pros: simple, easy to understand
Cons: very high performance cost
Reasons for high performance cost:
- Single source instruction → Tens of instruction in the target ISA
- Many branch instructions
- If they are hard to branch prediction, branch instructions reduce performance.
Q. Answer the number of branch instructions in decode-and-dispatch when processing LoadWordAndZero
instruction.
* Do not consider extract()
as function.
Answer
5 times (switch, call, if, return, jump-to-head) + break?2.2 Threaded Interpretation
Threaded interpretation reduce branch instructions from decode-and-dispatch interpretation.
Threaded Interpretation Overview
Decode-and-dispatch interpreter have many branch instructions:
- a branch for switch statement
- a branch to the interpretation routines
- a branch to return from routines
- a branch to top of loop
Threaded interpreter delete some branch instructions:
- a branch for switch statement
- a branch to return from routines
- a branch to top of loop
Threaded interpreter only have “branches to the interpretation routines”.
- routine0 -> routine1 -> routine2 -> …
Replace switch
with Table Access
switch
in decode-and-dispatch maps an opcode to the interpreter routine.
switch(opcode) {
case LoadWordAndZero: LoadWordAndZero(inst);
case ALU: ALU(inst);
case Branch: Branch(inst);
...
}
We can remove the switch
by introducing dispatch table.
FuncPtrArray dispatch_table[][] = {
{null}, // opcode = 00
...
{Branch, ...}, // opcode = 18
...
{ADD, ADDCarrying, AddExtended, ...}, // opcode = 31
{LoadWordAndZero, ...}, // opcode = 32
...
};
dispatch_table[opcode][extended_opcode](inst); // call interpretation routine
Deleted results of dispatch loop
// main interpreter loop
while (!halt && !interrupt) {
inst = code[PC]; // fetch one instruction
opcode = extract(inst, 31, 6);
extended_opcode = extract(inst, 10, 10);
dispatch_table[opcode][extended_opcode](inst);
}
Copy Dispatch-Loop to Every Routines
We can delete “dispatch loop” by copying the code into the end of interpreter routines.
And use goto
instead of function call to delete “branches to return from routines”.
ex) Threaded interpreter for PowerPC ISA
LoadWordAndZero:
RT = extract(inst, 25, 5);
RA = extract(inst, 20, 5);
displacement = extract(inst, 15, 16);
if (RA == 0)
source = 0;
else
source = regs[RA];
address = source + displacement;
regs[RT] = (data(address) << 32) >> 32;
PC = PC + 4;
// decode-and-dispatch
if (halt || interrupt)
goto exit;
inst = code[PC];
opcode = extract(inst, 10, 10);
extended_opcode = extract(inst, 10, 10);
routine = dispatch[opcode, extended_opcode];
goto *routine;
Add:
RT = extract(inst, 25, 5);
RA = extract(inst, 20, 5);
RB = extract(inst, 15, 5);
source1 = regs[RA];
source2 = regs[RB];
sum = source1 + source2;
regs[RT] = sum;
PC = PC + 4;
// decode-and-dispatch
if (halt || interrupt)
goto exit;
inst = code[PC];
opcode = extract(inst, 31, 6);
extended_opcode = extract(inst, 10, 10);
routine = dispatch[opcode, extended_opcode];
goto *routine;
Summary of Threaded Interpretation
Threaded interpreter delete branch instructions by
- replacing
switch
with dispatch table access - copying “dispatch loop” into the end of interpreter routines
- replacing function call with
goto
This method is called indirect threaded interpretation because the jump through the dispatch table is indirect.
Improvements and Room for Improvements
Improvement: Lower runtime overhead than decode-and-dispatch
Room for Improvements
- Memory overhead by dispatch table
- Runtime overhead by
- dispatch table access
- register indirect branch (difficult to predict target address)
2.3 Predecoding and Direct Threaded Interpretation
2.3.1 Basic Predecoding
Predecoding is to decode instructions before we start interpretation.
So far, decoding is performed just before dispatch.
ex) Decoding in threaded interpreter
if (halt || interrupt)
goto exit;
inst = code[PC];
// decode
opcode = extract(inst, 10, 10);
extended_opcode = extract(inst, 10, 10);
// dispatch
routine = dispatch[opcode, extended_opcode];
goto *routine;
Predecoding decode instructions and save the informations as intermediate form in advance.
ex) Predecoding for PowerPC ISA
lwz r1, 8(r2) ; load word and zero
add r3, r3, r1 ; r3 = r3 + r1
stw r3, 0(r4) ; store word
// intermediate form
struct instruction {
unsigned short op; // opcode
unsigned long extended_op; // extended_opcode
unsigned char dest; // destination register
unsigned char src1; // source register 1
unsigned int src2; // source register 2
} code [CODE_SIZE];
// predecoding in advance
code[0].op = 32; // opcode
code[0].dest = 1; // r1
code[0].src1 = 2; // r2
code[0].src2 = 8; // 8
code[1].op = 31; // opcode
code[1].extended_op = 266; // extended_op
code[1].dest = 3; // r3
code[1].src1 = 1; // r1
code[1].src2 = 3; // r3
// interpretation routines after predecoding
LoadWordAndZero:
RT = code[TPC].dest;
RA = code[TPC].src1;
displacement = code[TPC].src2;
if (RA == 0)
source = 0;
else
source = regs[RA];
address = source + displacement;
regs[RT] = (data[address] << 32) >> 32;
SPC = SPC + 4; // source program counter
TPC = TPC + 4; // intermediate form counter
// dispatch without decode
if (halt || interrupt)
goto exit;
opcode = code[TPC].op;
extended_opcode = code[TPC].extended_op;
routine = dispatch_table[opcode][extended_opcode]; // access to dispatch table
goto *routine;
2.3.2 Direct Threaded Interpretation
Direct threaded interpretation is goto
the next routines without accessing dispatch table.
Predecoding for Direct Threaded Interpretation
For direct threaded interpretation, we lookup dispatch table and save it in intermediate form when predecoding.
ex) Predecoding for direct threaded interpretation
lwz r1, 8(r2) ; load word and zero
add r3, r3, r1 ; r3 = r3 + r1
stw r3, 0(r4) ; store word
// intermediate form
struct instruction {
unsigned FuncPtr op; // function pointer to interpreter routine
unsigned char dest;
unsigned char src1;
unsigned int src2;
} code [CODE_SIZE];
code[0].op = dispatch_table[32][0]; // save LoadWordAndZero()
code[0].dest = 1; // r1
code[0].src1 = 2; // r2
code[0].src2 = 8; // 8
code[1].op = dispatch_table[31][266]; // save Add()
code[1].dest = 3; // r3
code[1].src1 = 1; // r1
code[1].src2 = 3; // r3
Direct Threaded Interpretation
We can goto
the next routine using code[i].op
without accessing dispatch table.
ex) Direct threaded interpretation for PowerPC ISA
LoadWordAndZero:
RT = code[TPC].dest;
RA = code[TPC].src1;
displacement = code[TPC].src2;
if (RA == 0)
source = 0;
else
source = regs[RA];
address = source + displacement;
regs[RT] = (data[address] << 32) >> 32;
SPC = SPC + 4;
TPC = TPC + 1;
if (halt || interrupt)
goto exit;
routine = code[TPC].op; // access the next routine without access dispatch_table
goto *routine;
Q. Answer the number of dispatch table access.
Pseudo assembly code
mov sum, 0 ; sum = 0
mov i, 0 ; i = 0
loop:
add sum, i ; sum += i
add i, 1 ; i += 1
cmp i, 3 ; if i < 3
jl loop ; jump to loop
Indirect Threaded Interpreter
14 times = 2 + 4 * 3Predecoding + Direct Threaded Interpreter
6 times = 2 + 42.4 Interpreting a Complex Instruction Set
Interpreting of CISC is harder than interpreting of RISC.
Review: CISC
CISC have the following characteristics:
- One instruction may perform many operations
- Variable length of instructions
- Fetch and decode is hard.
ex) Instruction format of IA-32
2.4.1 Interpretation of the IA-32 ISA
Method-1) Decode-and-Dispatch Interpretation: Basic Interpretation
Decode-and-dispatch interpretation for CISC ISA is almost the same as it for RISC ISA.
Dispatch Loop
void cpu_loop() {
while (!halt) {
// decode
instr = IA_32_FetchDecode(PC);
// dispatch
if (!IA_32_string_instruction) {
instr.execute();
} else {
while(need_to_repeat(instr.prefixmask)) {
instr.execute();
handle_async_event(); // i.e. an interrupt
}
}
PC = PC + instr.ilen;
handle_async_event();
}
}
Decode
Decoding for CISC ISA is hard because of complex format of CISC ISA.
// General instruction template (intermediate form)
struct IA_32instr {
unsigned short opcode;
unsigned short prefixmask;
char ilen; // instruction length
InterpreterFunctionPointer execute; // semantic routine for this instr
struct {
// general address computation: [Rbase + (Rindex << shmt) + displacement]
char mode;
char Rbase;
char Rindex;
char shmt;
long displacement;
} operandRM;
struct {
char mode; // either register or immediate
char regname; // register number
long immediate; // immediate value
} operandRI;
} instr;
// Big fetch_decode table indexed by the opcode
// - DecodeAction: format style(ModR/M, SIB, displacement, immediate, etc)
// - InterpreterFunctionPointer: pointer for the interpreter routine
IA_32OpcodeInfo_t IA_32_fetch_decode_table[] = {
{ DecodeAction, InterpreterFunctionPointer},
{ DecodeAction, InterpreterFunctionPointer},
...
};
// Decode instructions for CISC ISA
// and fill in struct instr.
IA_32instr
IA_32_FetchDecode(PC) {
fetch_ptr = PC;
// 1. parse prefixes
byte = code[fetch_ptr++];
while (is_IA_32_prefix(byte)) {
add_prefix_attribute(byte, instr);
byte = code[fetch_ptr++];
}
// 2. parse opcode
instr.opcode = byte;
if (instr.opcode == 0x0f) {
instr.opcode = 0x100 | code[fetch_ptr++]; // 2 Byte opcode
}
// 3. Table look up based on opcode to find action and function pointer
decode_action = IA_32_fetch_decode_table[instr.opcode].DecodeAction;
instr.execute = IA_32_fetch_decode_table[instr.opcode].InterpreterFunctionPointer;
// 4. Operand Resolution -- setup the operandRI and operandRM fields above
if (need_Mod_RM(decode_action)) {
parse_Mod_RM_byte(instr);
if (need_SIB_byte(instr->operandRM.mode))
fetch_SIB_byte(instr);
if (need_displacement(instr->operandRM.mode))
fetch_displacement(instr);
}
if (need_displacement(decode_action))
fetch_immediate(instr);
// 5. bookkeeping and return
instr.ilen = bytes_fetched_for_this_instr;
return instr;
}
Interpreter Routines
// ADD: register + Reg/Mem --> register
// - IA_32_GPR: General purpose register
// - virtual_mem: Virtual memory address
// - IA_32_CC_FLAGS: Status register
void
ADD_RX_32b(IA_32instr instr) {
unsigned op1_32, sum_32;
op1_32 = IA_32_GPR[instr.operandRI.regname];
if (mem_operand(instr.operandRM.mode)) {
unsigned mem_addr = resolve_mem_address(instr);
op2_32 = virtual_mem[mem_addr];
} else {
op2_32 = IA_32_GPR[instr.operandRM.Rbase];
}
sum_32 = op1_32 + op2_32;
IA_32_GPR[instr.operandRI.regname] = sum_32;
SET_IA_32_CC_FLAGS(op1_32, op2_32, sum_32, IA_32_INSTR_ADD32);
}
void ADD_XR_32b(IA_32instr instr) { ... }
void ADD_RI_32b(IA_32instr instr) { ... }
Pros and Cons
Pros: easy to understand
Cons: quite slow
One reason of its slowness is its generality.
General decoding is not needed for some simple instructions.
ex) Simple instruction nop = 0x90
Method-2) Optimization of Common Cases: “make the common case fast”
“Common case” can be made fast by this optimization.
For IA-32 ISA, the common case is:
- noprefix bytes
- a single-byte opcode
- simple operand specifiers
This interpreter do the followings:
- Fetch one byte and dispatch
- Interpret depending on the case:
- Simple(common) case: Simple routines interpret the instruction
- Complex case: Complex routines parse options and shared routine interprets the instructions.
- Prefix case: Set prefix flags
- Go to next instruction
2.4.2 Threaded Interpretation
Hybrid method of the decode-and-dispatch and threaded methods is suitable for interpretation of CISC ISA.
Review: Threaded Interpretation
Threaded interpreter delete branch instructions by
- replacing
switch
with table access - copying “dispatch loop” into the end of interpreter routines
- replacing function call with
goto
We cannot apply threaded interpretation directly to CISC ISA because of:
- memory overhead
- General decoding code for CISC is large.
- Copying it to every interpretation routine have large memory overhead.
- low performance improvement
- General decoding code for CISC is slow.
- Executing it every time make small improvement.
Method-3) Hybrid of Decode-and-Dispatch and Threaded Interpretation
We use them according to their roles:
- Treaded interpretation is used for “the common case”
- Decode-and-dispatch is used for complex instructions
To do so, we will copy “simple decode-and-dispatch code” to every interpretation routine because of:
- small memory overhead
- Simple decode-and-dispatch code is small
- high performance improvement
- Simple decode-and-dispatch code is fast
Method-4) Predecoding and Direct Threaded Interpretation
Predecoding of CISC ISA is very difficult.
There are two significant problems:
- Predecoded instruction format become very large.
struct IA_32instr
consumes 24 bytes
- Code discovery for CISC ISA is very hard.
- It is not practical to identify all instructions boundaries or to separate data from instructions
2.4.3 A High Performance IA-32 Interpreter
Software pipelining reduce execution time for interpretation.
Parallerization of Decode-and-Dispatch
Normal decode-and-dispatch do the following per one loop:
- decode the currect instruction
- dispatch the current instruction
“Decode” and “dispatch” cannot be processed in parallel because “dispatch” must use the result of “decode”.
Software Pipelining
Interpreter using software pipelining do the following per one loop:
- decode the next instruction
- dispatch the current instruction
“Decode” and “dispatch” can be processed in parallel because “current instruction dispatch” do not use the result of “next instruction decode”.
Method-5) Software Pipelining Interpreter
“Next instruction decode” and “current instruction dispatch” are processed in same loop.
ex) IA-32 interpreter in PowerPC assembly code
loop:
cmpwi cr0,r4,48 ; compare length with 48(bits)
bgt cr0, long_instr ; branch to long_instr if length > 48
sld r5,r1,r4 ; shift instruction I+1 into r5
extrdi r4,r5,16,0 ; extract upper 2 bytes of I+1 from "buffer"
sldi r5,r4,3 ; multiply by 8: convert to double word offset
lbzx r4,r4,r9 ; look up instruction length for I+1
ldx r7,r5,r8 ; look up interpreter routine for I+1
ld r11,0(r10) ; prefetch next 8 bytes
mtctr r16 ; move I's interpreter routine pointer into ctr
bctrl ; dispatch I; branch to ctr and link
mr r6,r7 ; move register; to maintain software pipeline
b loop ; continue loop
In pseudo C code
char instr_length_table[] = {
instr_length,
instr_length,
...
};
FuncPtr dispatch_table[] = {
FunctionPointer,
FunctionPointer,
...
};
void cpu_loop() {
while(1) {
if (instr.length > 48) { // Complex instruction
long_instr_routine(instr);
} else {
// next instruction decode
next_instr = code[++I];
upper_two_bytes = get_upper_two_bytes(next_instr);
next_instr.length = instr_length_table[upper_two_bytes];
next_instr.execute = dispatch_table[upper_two_bytes];
// currect instruction dispatch
instr.execute();
instr = next_instr;
}
}
}
2.5 Binary Translation
Binary Translation map from source binary to target binary.
Predecoding vs. Binary Translation
Predecoding need interpreter routines.
Binary translation can be processed directly.
We need state mapping for binary translation.
State Mapping
Interpreter code perform state mapping in interpretation.
We must define state mapping in binary translation because there is no code to perform state mapping.
ex) State Mapping from Target ISA to Source ISA.
Binary Translation
Binary Translation is similar to interpreter routines, but it use registers defined by state mapping.
ex) Binary Translation from IA-32 to PowerPC.
addl %edx, 4(%eax) ; %edx = %edx + *(%eax + 4)
movl 4(%eax), %edx ; *(%eax + 4) = %edx
add %eax, 4 ; %eax = %eax + 4
r1 points to IA-32 register context block
r2 points to IA-32 memory image
r3 contains IA-32 ISA PC value
lwz r4,0(r1) ; load %eax from register block
addi r5,r4,4 ; add 4 to %eax
lwzx r5,r2,r5 ; load operand from memory
lwz r4,12(r1) ; load %edx from register block
add r5,r4,r5 ; perform add
stw r5,r12(r1) ; put result into %edx
addi r3,r3,3 ; update PC (3 bytes)
lwz r4,0(r1) ; load %eax from register block
addi r5,r4,4 ; add 4 to %eax
lwz r4,12(r1) ; load %edx from register block
stwx r4,r2,r5 ; store %edx value into memory
addi r3,r3,3 ; update PC (3 bytes)
lwz r4,0(r1) ; load %eax from register block
addi r4,r4,4 ; add immediate
stw r4,0(r1) ; place result back into %eax
addi r3,r3,3 ; update PC (3 bytes)
Translate multiple instructions together
We can optimize binary translation by
- translating multiple instructions together
- changing state mapping to use registers directly
ex) Optimized Binary Translation
r1 points to IA-32 register context block
r2 points to IA-32 memory image
r3 contains IA-32 ISA PC value
r4 holds IA-32 register %eax ; use r4 directly
r7 holds IA-32 register %edx ; use r7 directly
addi r16,r4,4 ; add 4 to %eax
lwzx r17,r2,r16 ; load operand from memory
add r7,r17,r7 ; perform add of %edx
addi r16,r4,4 ; add 4 to %eax
stwx r7,r2,r16 ; store %edx value into memory
addi r4,r4,4 ; increment %eax
addi r3,r3,9 ; update PC (9 bytes) = sum of three instructions