Sat, 24 Nov 2007

# Simple token-threaded "Forth" interpreter. -*- asm -*-
# by Kragen Javier Sitaker; dedicated to the public domain,
# i.e. I relinquish whatever exclusive rights copyright law
# gives me with regard to this work.
# Major parts taken from Richard W.M. Jones's public-domain
# JONESFORTH 42 by Richard W.M. Jones <rich at annexia.org>
# http://annexia.org/forth

# This program just outputs "hello, world, hello" under Linux.

# to compile:   
# gcc -m32 -nostdlib -static -o tokthr2 tokthr2.S


### Why Small Things are Interesting

# There are still a lot of computers out there that have tens of
# kilobytes of memory or less, and they cost a lot less than,
# say, a cellphone.  Cellphones are apparently still too
# expensive for half the world's population.  I want to see how
# close I can get to having a comfortable programming
# environment in a smaller device.

# Some smallish microcontroller chips from five different
# manufacturers, with current Digi-Key prices:
# Name              bytes RAM  bytes ROM  MHz  price    
# ATtiny2313        128        2048       20   US$1.38  
# ATMega48-20AU     512        4096       20   US$1.62  
# MSP430F1111AIPW   128        2264       8    US$2.43  
# LPC2101           2048       8192       70   US$2.52  
# H8/300H Tiny      1536       8192       12   US$3.58  
# M16C/R8C/Tiny/1B  1024       16384      12   US$3.54  
# SX28AC/SS         136        3072       50   US$2.79  

# More practically and short-termly, small projects can take
# less time to finish, and I feel like I need to learn about
# different approaches to implementing programming languages.

### Why This is Small

# The normal Forth representation of a function is as an array
# of pointers to the other functions it calls, in sequence; a
# few of those other functions may move the interpreter pointer
# around in that array, or snarf up a constant that's stored in
# the array, or stuff like that, but for the most part, the
# functions just get called in sequence.  This is called
# "threaded code" and it's fairly compact, especially on 16-bit
# systems where the pointers are only two bytes.

# A traditional approach taken by Forth implementations to
# reduce code size even further is called "token threading".
# Rather than making arrays of 16-bit or 32-bit pointers, they
# make lists of 8-bit indices into an array of pointers.  This
# has two advantages:

# 1. the indices are one fourth the size of a list of 32-bit
#    pointers;
# 2. it is possible to save these lists of indices somewhere
#    outside of memory and continue to use them even after
#    making some changes to the code, as long as the same
#    indices in the table have the same meanings.  So, for
#    example, you could write some boot firmware in this
#    "bytecode".

# It also has some disadvantages:
# 1. You run out of space in the table.  Even a fairly minimal
#    full Forth system contains close to 256 subroutines.  You
#    can mitigate this by packing, say, two 12-bit pointers
#    every three bytes, or maybe by having a special bytecode
#    that looks up the next byte in an extended table.
# 2. It's slower and makes the machine-code part of the program
#    take more space.  The traditional LODSW; JMP AX version of
#    $NEXT from the eForth Model, which fetches and executes the
#    next execution token in the threaded list, is three bytes
#    and two instructions; my 'next' here is 41 bytes and 14
#    instructions, which is big enough that I jump to it (2
#    bytes) rather than making an assembler macro.  Which blows
#    your branch target buffers to pieces.  Oh well.  The
#    performance penalty is probably two orders of magnitude
#    over native code, but I haven't measured it yet.  I
#    measured an earlier version on my 700MHz PIII laptop on an
#    empty loop at about a factor of 3.5 over simple
#    direct-threading, which in turn is on the order of 10 times
#    slower than machine code.

# Anyway, so this is an example program built using this
# technique.  It implements two Forthlike stacks and interpreted
# subroutines, but not yet the ability to define new subroutines
# at run-time.

### What's Here

# I've implemented all of the primitives from C. H. Ting and
# Bill Muench's public-domain (?) eForth Model 1.0, except for
# the following:
# - I haven't implemented their lowercase "next" (as in FOR
#   NEXT) because I think it's a bad idea, it's complex, and it
#   can be implemented at a higher level if you really need it;
# - I didn't implement !IO because it's a no-op in this context;
# - I haven't yet implemented ?RX, although I think it's
#   possible to implement it on top of syscall5, using select();

# However, most of it is untested and therefore probably broken.
# Procedure call and return and the system calls do work.

# Currently registers are used as follows:
# %esi --- interpreter pointer; points to next byte to execute
# %ebp --- return stack pointer; points to last thing pushed.  This stack, 
#          like the other one, grows downwards.
# %esp --- data stack pointer; points to last thing pushed.  This is
#  	   the processor's standard stack pointer; "push" and "pop"
#  	   instructions use it, which makes assembly code to use it
#  	   quite concise.  The Intel "call" and "ret" instructions
#  	   would also use this stack, but they aren't used in this
#  	   program.
# flags --- the "down" direction flag must be cleared.
#           Fortunately this seems to be the case by default.

# It's probably missing a couple of primitives needed because of
# the token-threading implementation strategy; the address of
# the token table probably needs to be knowable, at least.

# Direct and indirect threading, the normal Forth approaches to
# allowing unrestricted coexistence of words written in assembly
# language and interpreted Forth, both had heavy space costs
# here --- close to 100% for the bytecode currently in the
# system.  So the inner interpreter checks, for each bytecode,
# whether it is in the range of bytecodes whose interpretations
# are in native code, and picks the relevant code path.  This
# avoids consuming any space per-word for this distinction, but with
# what I assume is a heavy performance cost.

### How Small This Is

# eForth 1.0's machine-code part seems to be 171 instructions
# and 399 bytes, including some data that's mixed in there with
# it.

# Last I checked, this program uses only 19 different
# instructions: jmp, jz; push, pop, lodsb, lodsl, xchg, mov;
# movsbl, inc, and, xor, or, lea, cdq, rcl, add, idiv; and int.

# At the moment, this program is 229 bytes in 125 machine-code
# instructions, plus 13 bytes of read-only data, 104 bytes of
# token table for the 52 currently-defined words, and another
# 143 bytes of bytecode in those words, and then another 22
# bytes in the main program, for a total of 511 bytes.

# One important thing that's missing here is the dictionary
# structure, which will minimally use up another hundred bytes
# or so.

### Other Things I Tried

# I tried switching to caching the top of the data stack in a
# register, on the theory that it would shorten things like
# 'and'.  Currently 'and' is pop %eax; pop %ebx; and %ebx,
# %eax; push %eax; jmp next.  If top of stack is cached in %eax
# instead of being stored in memory, this becomes pop %ebx; and
# %ebx, %eax; jmp next, which is considerably shorter.
# However, most things don't change, and other things become
# longer due to the extra work to save top-of-stack.  I tried
# using both %ebx and %eax as the top-of-stack cache.

# In the version using %ebx as top-of-stack, the total size of the
# machine-code part was 216 bytes, 115 instructions, compared to 197
# bytes, 112 instructions for the version using the current strategy.
# In the version using %eax as top-of-stack, it was only 215 bytes,
# but that's still worse than 197.

# In previous versions, all routines were machine-code routines
# that you could just jmp to.  High-level bytecode words began
# with "call dolist", which took the saved %eip off the stack
# and stuck it in %esi.  Unfortunately, that added 5 bytes to
# each bytecode word; as I write this, the bytecode region is
# 143 bytes and contains 24 word definitions --- so 5 bytes each
# would have been 120 bytes of overhead, or 84%!  It also would
# have required a region to be both executable and writable to
# support run-time routine definition, which is kind of a pain
# thse days.

# In previous versions, the token-table entries were 32 bits
# each (instead of 16 bits as they are now), which added another
# 2 bytes of overhead per word.  There are currently 52 words,
# so that's another 104 bytes of shaved overhead.

### What's Wrong With This Program

# - It's a long way from doing anything useful.
# - Only these words are tested: hello, sub1, type, comma,
#   world, newline, dolit_s8, dot, bye, exit, branch_on_0,
#   c_bang, drop, dup, swap, negative, umplus, divmod,
#   syscall5, syscall3, zero, syscall1, rpop, rpush, one,
#   dolit_32, neg1, add, emit, tuck, udot, udot_nospc,
#   udot_nonzero, branch, invert, add1, negate, xor.  These are
#   not tested, and therefore may be broken: execute, bang, at,
#   c_at, rp_at, rp_bang, sp_at, sp_bang, over, and, or, r_at.
# - There's no dictionary structure yet.
# - It probably needs another couple of primitives.
# - There's no checking for stack overflow or underflow, but
#   they will break things.

### The Beginning of the Program

# .. include system library header so we know __NR_exit = 1 and
# __NR_write = 4
#include <asm/unistd.h>

### The token table

        ## I was frustrated with the unreadability of my
        ## bytecode lists; I was counting token table entries
        ## by hand and writing bytecodes numerically.  So I
        ## wrote a macro to help.

        ## Note that we are using a separate .subsection
        ## directive because gas 2.17 doesn't support putting
        ## that in the .pushsection line, even though it is
        ## documented to do so; see message from Maciej
        ## W. Rozycki on 2007-10-11, subject "Re: How to use
        ## .pushsection?",
        ## http://sourceware.org/ml/binutils/2007-10/msg00176.html
        ## for more details)

        ## The first few entries in the table of bytecodes are
        ## all defined in machine code; the rest are all
        ## defined in bytecode.  The inner interpreter examines
        ## each bytecode to determine which category it falls
        ## in in order to figure out how to execute it,
        ## including what base address to add its offset to.
        ## This sucks for extensibility but rocks for
        ## compactness.

        .macro define_bytecode name, origin
        .pushsection .data      # save current position, go to data section
        .subsection 1           # and subsection 1, where we put the addrs
        b_\name = (. - token_table) / 2 # define b_foo as the index of this ptr
        .ifeq b_\name - 256
        .error "\name got bytecode 256"
        .endif
        #.int \name              # insert pointer which will be resolved next:
        .short \name - \origin # insert offset which will be resolved next
        .popsection             # return to where we were, and
\name:                          # define the name
        .endm
        .macro defasm name
        define_bytecode \name, machine_code_primitives
        .endm
        .macro defbytes name
        define_bytecode \name, bytecode_start
        .endm

        .data 1                 # Start putting stuff in data subsection 1
        .align 4
        ## table to define the "bytecode" instructions
token_table:

        .data 3
instructions:
	# And here is the actual "program" in that bytecode.
        .byte b_hello           # string "hello" and count
        .byte b_sub1            # subtract 1 from count: "hell"
        .byte b_type            # spit it out
        .byte b_comma, b_type, b_world, b_type # ", world"
        .byte b_comma, b_type, b_hello, b_type, b_newline, b_type
        # test the "dot" command to print out numbers
	.byte b_dolit_s8, -120, b_dot
        .byte b_dolit_s8, 104, b_dot, b_newline, b_type
        .byte b_bye

### The Return Stack
# We put Forth return addresses here, but programs can also use
# it for other purposes.

        .bss
        .space 4096
initial_return_stack_pointer:   

### Initialization
        	
        .text                   # the following stuff goes in the text segment
        .global _start          # declare _start as a global symbol 
                                # (otherwise ld won't be able to find it)
_start:                         # this is the entry point for ELF I guess
	mov $initial_return_stack_pointer, %ebp
        mov $instructions, %esi # %esi is the interpreter pointer register
        jmp next                # and now we start the interpreter.
                                # (somewhat silly since we could just
                                # fall through..)

### The Machine-Code Primitives
# Also next (aka the address interpreter or inner interpreter)
# is in this section.

machine_code_primitives:
                
# dolit_s8 takes a signed 8-bit literal from the instruction
# stream and pushes it onto the stack.

        defasm dolit_s8
        lodsb
        movsbl %al, %eax
        push %eax
        jmp next

        defasm dolit_32         # more general dolit
        lodsl
        push %eax
        jmp next
        
        defasm exit             # Return from a colon defn.
        xchg %ebp, %esp
        pop %esi
        xchg %ebp, %esp
        jmp next

        defasm execute          # Run xt on data stack.
        pop %eax                # Here 'xt' is the one-byte token.
        jmp execute_eax

# Branch if top of stack is 0 (implementing IF).
# Both branch instructions take a signed byte offset from the bytecode
# stream.
        defasm branch_on_0
        pop %eax
        and %eax, %eax
        jz branch
        inc %esi                # skip 1-byte jump offset
        jmp next

        defasm branch
        lodsb
        movsbl %al, %eax        # same insn size as cbtw; cwde
        add %eax, %esi
        jmp next	

# Store a cell.
        defasm bang
        pop %ebx
        pop (%ebx)              # I'm amazed this is legal
        jmp next
# Fetch a cell.
        defasm at
        pop %ebx
        push (%ebx)             # I'm amazed this is legal too
        jmp next

# Store a byte.
        defasm c_bang
        pop %ebx
        pop %eax
        mov %al, (%ebx)         # push and pop don't do bytes
        jmp next

# Fetch a byte.
        defasm c_at
        pop %ebx
        xor %eax, %eax
	mov (%ebx), %al
        push %eax
        jmp next
        
# Get the return stack pointer.
        defasm rp_at
        push %ebp
        jmp next
        
# Set the return stack pointer.
        defasm rp_bang
        pop %ebp
        jmp next
        
# Pop the return stack to the data stack ( R> )
        defasm rpop
        push (%ebp)
        lea 4(%ebp), %ebp       # add or xchg/pop: same size
        jmp next
        
# Push the return stack from the data stack ( >R )
        defasm rpush
        lea -4(%ebp), %ebp
        pop (%ebp)
        jmp next

# Get the data stack pointer (before it gets pushed).
        defasm sp_at
        push %esp               # safe on 286 and later
        jmp next

# Set the data stack pointer.
        defasm sp_bang
        pop %esp
        jmp next
        
# Pop the stack.
        defasm drop
        pop %eax
        jmp next
        
# Push a copy of TOS.
# eForth 1.0 used BX to index the stack here, for a couple of
# reasons: on the 8086, SP got decremented prior to the fetch,
# and also wasn't valid as a base or index register.
        defasm dup
        push (%esp)
        jmp next
        
# Swap top two stack items ("exch" in PostScript)
        defasm swap
        pop %eax
        pop %ebx
        push %eax
        push %ebx
        jmp next
# Stack manipulation ( w1 w2 -- w1 w2 w1 )
# technically not necessary, but it's so easy and tiny
        defasm over
        push 4(%esp)           
#        jmp next               fall through because "next" is next
	
# "next" fetches the next bytecode and runs it.  It's placed
# here in the middle of the bytecode definitions so that more
# of them can use the short two-byte jump form to get to it.

next:
        xor %eax, %eax          # set %eax to 0
        xor %ebx, %ebx          # clear high half of %ebx
        lodsb                   # load %al from where %esi points
                                # (%esi is the interpreter pointer)
execute_eax:
        ## load offset of new word into %ebx
        mov token_table(,%eax,2), %bx  # bx := token_table[eax * 2bytes]
        cmp $last_asm_bytecode, %eax
        jbe next_primitive      # if primitive, handle primitive word
        ## otherwise, handle a bytecode definition or "colon list"
        # save old %esi on return stack
        xchg %ebp, %esp
        push %esi
        xchg %ebp, %esp
        lea bytecode_start(%ebx), %esi
        jmp next

next_primitive:
        lea machine_code_primitives(%ebx), %ebx
        jmp *%ebx
        

# Push true if n negative. ( n -- f )
        defasm negative
        pop %eax                
        cdq
        push %edx
        jmp next

# Bitwise operators:
        defasm and
        pop %eax
        pop %ebx
        and %ebx, %eax
        push %eax
        jmp next

        defasm or
        pop %eax
        pop %ebx
        or %ebx, %eax
        push %eax
        jmp next

        defasm xor
        pop %eax
        pop %ebx
        xor %ebx, %eax
        push %eax
	jmp next
        
# add two unsigned numbers, returning sum and carry.
# ( u1 u2 -- u3 cy )
        defasm umplus
        xor %ecx, %ecx
        pop %eax
        pop %ebx
        add %ebx, %eax
        rcl $1, %ecx
        push %eax
        push %ecx
        jmp next

# Divide double-precision by single-precision, unsigned.
# UM/MOD from eForth.  ( udl udh un -- ur uq )
        defasm divmod
        pop %ebx
        pop %edx
        pop %eax
        idiv %ebx
        push %edx
        push %eax
        jmp next

# Copy the top of the return stack onto the data stack.
# May be the traditional Forth word "I".
        defasm r_at
        push (%ebp)
        jmp next

# syscall5:   
# Linux system call with up to 5 arguments
# This is no longer the fashionable way to make system calls
# in Linux.  Now you're supposed to use SYSENTER on newer
# CPUs, and rather than have you figure out which one to use,
# the kernel mmaps a chunk of code called a VDSO into your
# memory space at a random address and tells you where to
# find it using the ELF auxiliary vector.  Then you're
# supposed to invoke the dynamic linker or something to parse
# the ELF executable mysteriously manifested in this way by
# the kernel, and then resolve an undefined symbol in libc
# into calls to it.  See "What is linux-gate.so.1?"
# http://www.trilithium.com/johan/2005/08/linux-gate/
# "The Linux kernel: System Calls" by Andries Brouwer, 2003-02-01
# http://www.win.tue.nl/%7Eaeb/linux/lk/lk-4.html
# "About ELF Auxiliary Vectors" by Manu Garg
# http://manugarg.googlepages.com/aboutelfauxiliaryvectors

# But the old int $0x80 approach still works, thank goodness,
# because all of that is *way* more than these ten
# instructions.
        defasm syscall5
        pop %edi
        ## we have to save %esi for the interpreter
        mov %esi, -4(%ebp)
        pop %esi
        pop %edx
        pop %ecx
        pop %ebx
        pop %eax
        int $0x80
        push %eax
        mov -4(%ebp), %esi
        jmp next

        last_asm_bytecode = b_syscall5

### Basic Interpreted Words
        ## a macro for defining interpreted words
        ## Because after I left off b_exit once, I wasted a long
        ## time trying to figure out what was wrong
        .macro def name, bytes:vararg
	defbytes \name
        .byte \bytes
        .byte b_exit
        .endm

        .data 2                 # separate subsection from token table
bytecode_start:
# System call with three arguments.
        def syscall3, b_zero, b_zero, b_syscall5
# System call with one argument.
        def syscall1, b_zero, b_zero, b_syscall3
        def bye, b_dolit_s8,__NR_exit, b_zero, b_syscall1 # exit program
        def zero, b_dolit_s8,0            # push 0
        def one, b_dolit_s8,1
	
# This word outputs a string whose address and count are on 
# the stack.  ( b u -- )

        defbytes type
        .byte b_rpush, b_rpush  # move two args onto rstack
                                # system call is __NR_write:    
        .byte b_dolit_s8,__NR_write
        .byte b_one             # push constant 1: stdout
        .byte b_rpop, b_rpop    # move two args back from rstack
        .byte b_syscall3        # call syscall with 3 args
        .byte b_drop            # discard return value
        .byte b_exit            # return

# The next few words exist just to poke string addresses
# and lengths onto the stack so "type" can print them.
        .macro make_counted_string name, contents
        defbytes \name
        .byte b_dolit_32        # dolit_32 pushes a 32-bit
        .int string_\name       # literal --- an addr, here
                                # now push literal length and return
        .byte b_dolit_s8,string_length_\name, b_exit
        .pushsection .rodata    # define the actual string:
string_\name:
        .ascii "\contents"
        ## Here we count the length of the string --- computers
        ## are for counting bytes so people don't have to!
        string_length_\name = . - string_\name
        .popsection
        .endm

        make_counted_string hello, "hello"
        make_counted_string world, "world"
        make_counted_string comma, ", "
        make_counted_string newline, "\n"

### Some More Basic Words

        def neg1, b_dolit_s8, -1   # ( -- -1 )
        def add, b_umplus, b_drop  # ( a b -- a+b ) drop the carry
        def sub1, b_neg1, b_add    # ( n -- n-1 )
        def rot, b_rpush, b_swap, b_rpop, b_swap # ( a b c -- b c a )
        def unrot, b_rot, b_rot    # ( a b c -- c a b )
        def tuck b_dup, b_unrot    # ( a b -- b a b )

# emit: output a single byte.  eForth calls this "TX!".

# This version is 11 bytes, including the buffer byte, plus the 2-byte
# token table pointer. a machine-code version I wrote the other day
# was 28 bytes.  However, I also added rot, unrot, and tuck to support
# this function, and they total 11 bytes, plus 6 bytes of overhead.
# For a total of 11+2+11+6 = 30 bytes.  Not winning yet on size over
# x86 asm!  But we're getting close.

emit_buffer:
        .byte 0
        defbytes emit
        .byte b_dolit_32
        .int emit_buffer
        .byte b_tuck            # save a copy of address for b_type
        .byte b_c_bang          # store into emit buffer
        .byte b_one, b_type, b_exit # output one-byte buffer

### "u." prints out an unsigned number.
# I had a version of this in x86 machine code in 52 bytes (23
# instructions), essentially exactly the same code as here.
# This is 31 bytes, plus 6 bytes of overhead, plus I had 
# to define b_divmod (9 bytes plus 2 bytes overhead).  Now we are
# starting to win!

        defbytes udot            # print space after number
        .byte b_udot_nospc, b_dolit_s8, 0x20, b_emit, b_exit 
        defbytes udot_nospc      # print number without space
        .byte b_dup, b_branch_on_0,2, b_udot_nonzero, b_exit
        .byte b_drop, b_dolit_s8, '0, b_emit, b_exit
	defbytes udot_nonzero
	.byte b_zero, b_dolit_s8,10, b_divmod # divide by 10
        .byte b_dup, b_branch_on_0,3, b_udot_nonzero # recurse if nonzero
	.byte b_branch,1        # else
        .byte b_drop            # drop zero quotient
        .byte b_dolit_s8, '0, b_add, b_emit # print digit
	.byte b_exit
	
### Add signed numeric output, ".".  This cost 20 bytes plus 8 bytes
# of overhead, but added some fundamental numeric operations; only 12
# of those 28 bytes are specific to "."

        # logical not: return true for 0, false (0) otherwise
	#def zeq, b_branch_on_0,2, b_zero, b_exit, b_neg1
        # logical bitwise not
        def invert, b_dolit_s8, -1, b_xor
        def add1, b_one, b_add
        # arithmetic negation
        def negate, b_invert, b_add1
        # print signed number
        defbytes dot
        .byte b_dup, b_negative, b_branch_on_0,4
        .byte b_dolit_s8, '-, b_emit, b_negate # in the negative case
        .byte b_udot, b_exit

### Obviously the next thing to do is to add ".S", print the
# stack, so that I can stop having to investigate problems by
# using gdb.

Coase argues that firms grow as long as the costs of organizing
transactions outside the firm, using the market pricing mechanism, are
smaller than the costs of organizing those same transactions inside the
firm, using management.

Today we are seeing a trend toward running more and more of our
networked applications inside of Google's data center.  (Or Amazon's, or
whatever.)  Perhaps this happens because the costs of organizing
exchanges of information between administrative domains on the internet
--- not necessarily using a market pricing mechanism, but not involving
complete mutual trust --- are nonzero, and the costs of organizing those
same exchanges inside of a single administrative domain are sometimes
smaller.

Suppose this is the case.  What can we predict?  What kinds of
technological or social developments might increase the costs (or
decrease the efficacy) of exchanges of information between
administrative domains?  Spam is an obvious example; so is software
insecurity in general.  Proprietary systems software, or systems
software that is not designed for internet scale, could also have such
effects.

What kinds of technological or social developments might decrease the
costs or increase the efficacy of these exchanges of information?
Better systems security, better decentralized source control systems (so
that software improvements can be shared more readily between
administrative domains, standardized protocols covering a wider range of
the needed functionality.

But the explanation for these trends might not be that inter-domain
costs are going up; it might also be that intra-domain costs are going
down.  Things like the Google Filesystem, Mapreduce, rsync, backuppc,
Xen, VMWare, and the like, may allow larger systems to be run by fewer
sysadmins.

(Riffs on a discussion tonight with Rohit Khare, who's visiting us here
in Argentina.)