JOP - Java Optimized Processor
First Approach: A general purpose accu/register machine
This design is the result from earlier researches on simple processor structures. There are no JVM specific parts.
The processor is a accu register machine. This means that every result from an operation is stored in on special register the accu. The operands are a register and the accu. For load and stores there are two addressing modes: register and inirect with displacement. All constants are realized as a special registers. There are up to 1024 registers. All instructions (except branches and jumps) take one cycle (3 stage pipeline). Accu and registers are 32 bit. The instructions are fetched from on chip memory.
The instruction set is in the idea of RISC, very simple and fits in 16 bit.
An example program:
Serial.asm reads one line (with echo) from the serial port and prints the reversed string.
The current JVM (jvm.asm) is implemented in JOP-assembler. Every instruction is fetched, decoded and executed in software. JVM PC and SP are ordinary registers of JOP.
Fetch and decode:
fetch // label ld (pc+1) // pc points to old byte code add jmp_tbl st tmp ld pc // increment pc add 1 st pc jp (tmp) // jump indirect in jump table
Part of jump table:
jmp_tbl ... // all 256 possible values are listed bnz iadd // 0x60 bnz notimp bnz notimp // means not jet implemented bnz notimp bnz isub bnz notimp bnz notimp bnz notimp ...
It can be seen that fetch and decode of a JVM instrcution takes 6 + 2*3 + n cycles. With n cycles added for the memory access time.
A sample JVM stack instruction:
iadd ld (sp) // read first argument st tmp ld (sp-1) // read second argument add tmp // execute st (sp-1) // store back ld sp // decrement stack pointer sub 1 st sp jp fetch // jmp to next fetch
The execution of this instruction takes 8 + 3 + 3*n cycles.
In the best case (assuming a local, single cycle memory) a simple instruction takes:
With the jump table in local memory (0 WS) at address 0 the fetch/decode can be reduced to 8 + 3 + n cycles:
fetch ld (pc+1) // pc point to old byte code st tmp st tmp // read after write ld (tmp) st tmp ld pc // increment pc add 1 st pc jp (tmp) // jump indirect to instruction
A change in the instrcution set with ld and jmp indirect accu would reduce the fetch/decode to 5 + 3 + n cycles.
fetch ld pc // increment pc add 1 st pc ld (a) // load instruction accu indirect ld (a) // load address from jmp table accu indirect jp (a) // jump accu indirect to instruction
The last possible enhancement with this architecture would be to copy the fetch code to the end of every instruction. This will save one jmp but adds extra code.
iadd ld (sp) // read first argument st tmp ld (sp-1) // read second argument add tmp // execute st (sp-1) // store back ld sp // decrement stack pointer sub 1 st sp ld pc // increment pc add 1 st pc ld (a) // load instruction accu indirect ld (a) // load address from jump table accu indirect jp (a) // jump accu indirect to instruction
Now one instruction takes 13 + 3 + 4*n cycles. The main disadvantage is, that we loose the central execution point of the fetch loop. This point would be ideal to switch threads and poll for I/O requests.
You can download the complete design.
More enhancements need a redesign: Second Approach: More specific for the JVM
Copyright © 2000-2007, Martin Schoeberl