Edit: Lutz Euler points out that NEXT Sequence (used) encodes an effective address with an index register but no base. The mistake does not affect the meaning of the instruction, but forces useless encoding. The differences in machine code are as follows.
I have defined the NEXTBut not the disassembly snippet below; They still show the old machine code.
Earlier this week, I took another look at the F18. As always with Chuck Moore’s work, it’s hard to tell the difference between madness and mere genius 😉 One thing that surprised me is how small the stack is: 10 slots, without any fancy overflow/underflow traps. The logic is that, if you need more slots, you’re doing it wrong, and that silent overflow is useful when you know what you’re doing. This certainly matches my experience on the HP-41c and x87. It also reminds me of a post by DJB condemning our abuse of x87’s rotating stack: his thesis was that, with careful scheduling, a “free” FXCH Makes the stack equivalent – if not better – than registers. The post ends with a (non-pipelined) loop that wastes no cycles shuffling data due to x87’s built-in stack rotation.
This makes me wonder what implementation techniques become available for stack-based VMs that limit their stack to 8 slots. Obviously, it would be ideal to keep everything in a register. However, if we do this naively, the push and pop become much more complex; There’s a reason why Forth engines typically only cache the top 1-2 elements of the stack.
I decided to mimic x87 and F18 (edit: modulo the latter two TOS cache registers): pushing/popping causes no data movement. Instead, as shown in the figure below, they decrement/increment a modular counter that points to the top of the stack (TOS). In software it will still be slow (most ISAs can’t index registers). The main thing is that the counter cannot take too many values: only 8 values ​​if there are 8 slots in the stack. Stack VMs already mimic primops for performance reasons (e.g., to help BTB by spreading the execution of the same primitive among multiple addresses), so it seems reasonable to specialize the primop for all 8 values ​​the stack counter can take.
In a regular direct threaded VM, most primitives will end with a code sequence that leads to the next one (NEXT), something like add RSI, 8 ; Increase virtual IP before JMP jump [rsi-8] ; Go to the address where RSI previously reported rsi Virtual instructions are pointers, and VM instructions are simply pointers in machine code to the corresponding primitive.
I will make two changes in this order. I don’t like hardcoding addresses in bytecode, and 64 bits per virtual instruction is highly wasteful. Instead, I would encode the offset from the premap code block: mov eax, [rsi]
add rsi, add 4 racks, where rdi jmp racks rdi is the base address for preamps.
I also need to dispatch based on the new value of the underlying stack counter. I decided to make dispatching as easy as possible by storing variants of each premap at regular intervals (e.g. a page). I rounded it off 64 * 67 = 4288 Bytes to reduce aliasing accidents. NEXT mov eax becomes something like, [rsi]
add rsi, 4 li rax, [rax + rdi + variant_offset]
JMP Racks
the trick is variant_offset = 4288 * stack_counterAnd the stack counter is (usually) known when the primitive is compiled. If the stack is left as is, the counter is also left as is; Pressing a value decreases the counter and pressing one increases the counter.
This seems fair enough. Let’s see if we can make it work.
I want to figure out a problem for which I will emit a lot of repetitive machine code. SLIME’s REPL and SBCL’s assembler are perfect for this task! (I hope it’s clear that I’m using unsupported internals; if it breaks, you keep the pieces.)
The basic design of a VM is:
r8–r15:Stack Slot (32bit);
rsi: base address for machine code primitives;
rdi: virtual instruction pointer (points to next instruction);
(import '(sb-assem:inst sb-vm::make-ea)) ; we'll use these two a lot
;; The backing store for our stack
(defvar *stack* (make-array 8 :initial-contents (list sb-vm::r8d-tn
sb-vm::r9d-tn
sb-vm::r10d-tn
sb-vm::r11d-tn
sb-vm::r12d-tn
sb-vm::r13d-tn
sb-vm::r14d-tn
sb-vm::r15d-tn)))
;; The _primop-generation-time_ stack pointer
(defvar *stack-pointer*)
;; (@ 0) returns the (current) register for TOS, (@ 1) returns
;; the one just below, etc.
(defun @ (i)
(aref *stack* (mod (+ i *stack-pointer*) (length *stack*))))
(defvar *code-base* sb-vm::rsi-tn)
(defvar *virtual-ip* sb-vm::rdi-tn)
(defvar *rax* sb-vm::rax-tn)
(defvar *rbx* sb-vm::rax-tn)
(defvar *rcx* sb-vm::rax-tn)
(defvar *rdx* sb-vm::rax-tn)
;; Variants are *primitive-code-offset* bytes apart
(defvar *primitive-code-offset* (* 64 67))
;; Each *stack-pointer* value gets its own code page
(defstruct code-page
(alloc 0) ; index of the next free byte.
(code (make-array *primitive-code-offset* :element-type '(unsigned-byte 8))))
The idea is that we will define functions to emit assembly code for each primitive; These functions will be implicitly parameterized *stack-pointer* thanks for doing @. We can then call them as many times as needed to cover all values *stack-pointer*. The only complication is that the code sequences will vary in length, so we need to insert padding to keep everything in sync. this is it emit-code
Does:
(defun emit-code (pages emitter)
;; there must be as many code pages as there are stack slots
(assert (= (length *stack*) (length pages)))
;; find the rightmost starting point, and align to 16 bytes
(let* ((alloc (logandc2 (+ 15 (reduce #'max pages :key #'code-page-alloc))
15))
(bytes (loop for i below (length pages)
for page = (elt pages i)
collect (let ((segment (sb-assem:make-segment))
(*stack-pointer* i))
;; assemble the variant for this value
;; of *stack-pointer* in a fresh code
;; segment
(sb-assem:assemble (segment)
;; but first, insert padding
(sb-vm::emit-long-nop segment (- alloc (code-page-alloc page)))
(funcall emitter))
;; tidy up any backreference
(sb-assem:finalize-segment segment)
;; then get the (position-independent) machine
;; code as a vector of bytes
(sb-assem:segment-contents-as-vector segment)))))
;; finally, copy each machine code sequence to the right code page
(map nil (lambda (page bytes)
(let ((alloc (code-page-alloc page)))
(replace (code-page-code page) bytes :start1 alloc)
(assert (<= (+ alloc (length bytes)) (length (code-page-code page))))
(setf (code-page-alloc page) (+ alloc (length bytes)))))
pages bytes)
;; and return the offset for that code sequence
alloc))
This function is used emit-all-code Emitting machine code for a set of primitives, tracking the start offset for each primitive.
These are relatively tight. I definitely like how little data manipulation occurs; NEXT The sequence is a bit hairy, but the indirect branch is probably its weakest (and least avoidable) point.
A VM without control flow isn’t even a toy. First, the unconditional relative jump. These can be encoded as [jmp] [offset]with a 32 bit offset relative to the end of offset. we just overwrite *virtual-ip* With new address.
We have almost enough material for an interesting demo after all. The only important thing that is still missing is the call from CL to VM. I will assume that the caller takes care of saving any important registers, and preempts (rsi) and virtual ip (rdi) The registers are setup correctly. The stack will be filled on entry, by copying values ​​from the buffer. rax Points to, and is written back on exit.
12345678910111213141516171819
(defun enter ()
(inst push sb-vm::rbp-tn)
(inst mov sb-vm::rbp-tn sb-vm::rsp-tn) ; setup a normal frame
(inst push *rax*) ; stash rax
(dotimes (i 8) ; copy the stack in
(inst mov (@ i) (make-ea :dword :base *rax* :disp (* 4 i))))
(next))
(defun leave ()
;; restore RAX
(inst mov *rax* (make-ea :qword :base sb-vm::rbp-tn :disp -8))
;; overwrite the output stack
(dotimes (i 8)
(inst mov (make-ea :dword :base *rax* :disp (* 4 i))
(@ i)))
;; and unwind the frame
(inst mov sb-vm::rsp-tn sb-vm::rbp-tn)
(inst pop sb-vm::rbp-tn)
(inst ret))
Now we can write either (essentially) straightline code or infinite loops. We need the conditional. Their implementation is quite similar jmpWith a little twist. Let’s start with jump if (top of the stack) is non-zero and jump if (top of the stack) is non-zero.
1234567891011121314151617
(defun jcc (cc)
;; next word is the offset if the condition is met, otherwise,
;; fall through.
(inst movsx *rax* (make-ea :dword :base *virtual-ip*))
(inst lea *rax* (make-ea :dword :base *virtual-ip* :index *rax*
:disp 4))
(inst add *virtual-ip* 4)
(inst test (@ 0) (@ 0))
;; update *virtual-ip* only if zero/non-zero
(inst cmov cc *virtual-ip* *rax*)
(next))
(defun jnz ()
(jcc :nz))
(defun jz ()
(jcc :z))
It is difficult to write programs without immediate values. Earlier control flow primitives already encode the immediate data in the virtual instruction stream. we will do the same lit, incAnd dec: :
123456789101112
(defun lit ()
(decf *stack-pointer*) ; grow the stack
(inst mov (@ 0) (make-ea :dword :base *virtual-ip*)) ; load the next word
(next 4)) ; and skip to the next instruction
(defun inc ()
(inst add (@ 0) (make-ea :dword :base *virtual-ip*))
(next 4))
(defun dec ()
(inst sub (@ 0) (make-ea :dword :base *virtual-ip*))
(next 4))
Finally, we have enough for a nice looking (if useless) loop. First, update the premap code page:
12345678
;; C-M-x to force re-evaluation, or defparameter
(defvar *primops* '(enter leave lit
swap dup drop
add sub inc dec
jmp jnz jz
call ret))
(defvar *primops-offsets* (assemble-primops))
12345678910
CL-USER> (vm '(1000000) (assemble '(lit 1 sub jnz -20 leave))
*code-page*)
Evaluation took:
0.009 seconds of real time
0.009464 seconds of total run time (0.009384 user, 0.000080 system)
100.00% CPU
14,944,792 processor cycles
0 bytes consed
#(0 0 0 0 0 0 0 1)
One million iterations of this stupid loop, which only decrements a counter, took 15M cycles. 15 cycles/iterations is actually not that bad… especially considering that it does an indirect jump after loading the 1, subtracting, and comparing with the 0.
We can do better by fusing. lit sub In dec: :
12345678910
CL-USER> (vm '(1000000) (assemble '(dec 1 jnz -16 leave))
*code-page*)
Evaluation took:
0.007 seconds of real time
0.006905 seconds of total run time (0.006848 user, 0.000057 system)
100.00% CPU
11,111,128 processor cycles
0 bytes consed
#(0 0 0 0 0 0 0 0)
Decrementing a counter and jumping if non-zero is a common operation (older x86 also implemented this in hardware loop). Let’s add increments and jump if not zero (djn) to the VM:
CL-USER> (vm '(1000000) (assemble '(djn -8 leave))
*code-page*)
Evaluation took:
0.005 seconds of real time
0.005575 seconds of total run time (0.005542 user, 0.000033 system)
120.00% CPU
8,823,896 processor cycles
0 bytes consed
#(0 0 0 0 0 0 0 0)
This is better… but I’m not really convinced by the conditional move. The branch will usually be predictable, so it makes sense to expose and duplicate it in hardware NEXT Sequence.
12345678910
(defun djn2 ()
(sb-assem:assemble ()
(inst sub (@ 0) 1)
(inst jmp :z fallthrough)
(inst movsx *rax* (make-ea :dword :base *virtual-ip*))
(inst lea *virtual-ip* (make-ea :dword :base *virtual-ip* :index *rax*
:disp 8))
(next -4) ; might as well pre-increment *virtual-ip*
fallthrough ; ASSEMBLE parses for labels, like TAGBODY
(next 4)))
The resulting code is not very large, and the two indirect jumps are 16 bytes apart.
This alternative implementation works better on our naive loop.
12345678910
CL-USER> (vm '(1000000) (assemble '(djn2 -8 leave))
*code-page*)
Evaluation took:
0.004 seconds of real time
0.004034 seconds of total run time (0.003913 user, 0.000121 system)
100.00% CPU
6,183,488 processor cycles
0 bytes consed
#(0 0 0 0 0 0 0 0)
Let’s see how this compares to straight assembly code.
123456
(defun ubench ()
(sb-assem:assemble ()
head
(inst sub (@ 0) 1)
(inst jmp :nz head)
(next)))
12345678910
CL-USER> (vm '(1000000) (assemble '(ubench leave))
*code-page*)
Evaluation took:
0.000 seconds of real time
0.000629 seconds of total run time (0.000628 user, 0.000001 system)
100.00% CPU
1,001,904 processor cycles
0 bytes consed
#(0 0 0 0 0 0 0 0)
My slow MacBook Air gets 1 iteration/cycle on a loop which is 100% overhead control. with djn2A good implementation of a reasonable distinct operator, loop is about 6 times slower than the original code. a worse implementation of djn This is still only 8x slower than pure native code, and extremely specialized bytecode is 11-15x slower than native code.
Specializing primeops into a virtual stack pointer is possible in practice when the stack is limited to small sizes. It also seems to have a reasonable runtime overhead for threaded interpreters. I’m not really interested in straight stack languages; However, I believe that a fixed stack VM makes a good runtime IR when combined with a mechanism for local variables. We’ll see if I find time to translate the high level language into a superoperator for such VMs. Importance of fused operators will diminish NEXT; In short, simpler function calls (because there’s less shuffling to get objects in the right position) will remain just as useful.
SBCL has certainly proven itself to be a good companion for exploring the generation of domain-specific machine code. I don’t know of any other language implementation with such support for interactive programming and machine code generation (and inspection). FWIW, I believe Luajit + Dynasm will be comparable soon.
Steel Bank Common Lisp: Because sometimes C abstracts away too much 😉
4 thoughts on “SBCL: the ultimate assembly code breadboard”
Picked up two new ideas that I expect will come up in conversations this week, and a look at orbitway added another, content that arms me with talking points rather than just filling time is the kind that provides ongoing value beyond the moment of reading and this site is generating that kind of ongoing value.
Picked up two new ideas that I expect will come up in conversations this week, and a look at orbitway added another, content that arms me with talking points rather than just filling time is the kind that provides ongoing value beyond the moment of reading and this site is generating that kind of ongoing value.
melbet iphone app melbet iphone app
100cuci site 100cuci site
вызов врача нарколога на дом https://narkolog-na-dom-voronezh-12.ru