PB compiler tutorial: 2009

03 October 2009

14. Symbol Table

Edit: This turned out to need some adjustments. Get the correct version here: http://pbtut.blogspot.com/2010/01/listing-symboltablepb.html

The compiler needs to have a list of all the symbols used in the program. This includes global variables, local variables, functions, and similar.

A variable inside a function may have the same name as a global variable. A part of the program where variables must be unique is called a scope. In most programming languages nesting of scopes is allowed. In PureBasic, one level of nesting is allowed: You can have local scopes (in procedures) __inside__ the global scope (outside procedures). In Pascal scopes can be arbitarily nested. First, each file ("unit") has its own scope, and this scope may contain procedures, and procedures may actually contain other, local, procedures.

To allow for this nesting, symbol tables are usually best implemented with a tree structure. For simplicity, we won't do that. We will simply shove all symbols into a hash map. To distinguish between scopes, we will put the name of the containing scope in front of the symbol name, separated by a ":". Here's an example:

Global MyGlobalVariable, RedVariable
Procedure NewScopeInHere()
  RedVariable = 1 ; (global)
  Protected MyLocalVariable, RedVariable
  RedVariable = 2 ; (local)
EndProcedure

The symbol table will have these entries:
MyGlobalVariable
RedVariable
NewScopeInHere
NewScopeInHere:MyLocalVariable
NewScopeInHere:RedVariable

You can see that even though we have two variables named RedVariable, there is no problem, because they are in different scopes.

Each entry in our hash table will be a pointer to a structure of either the type SVariable or the type SFunction. However, to know which type we're dealing with, we need a common header, which tells us the type. The common header is the type SSymbol, which we can see here:

Structure SSymbol
  Name.s
  Kind.i
EndStructure

Name.s is the name without the scope (like "RedVariable" for both those variables). Kind tells us what kind of symbol this (procedure or variable). The possible values are specified in an enumeration:

Enumeration ; Symbol kind
  #SKind_Variable
  #SKind_Function
EndEnumeration

The following two structures extend SSymbol, and are the ones we actually allocate and use:

Structure SVariable Extends SSymbol
  Type.i
  RefKind.i ; Global, local, parameter
  RefOffset.i
EndStructure

Structure SFunction Extends SSymbol
  Type.i    ; Return type
  Ref.s     ; Entry point
  ParamCount.i
  ParamTypes.i[10] ; We allow at most 10 parameters
EndStructure

Important: I have defined the word "type" to mean datatype. Everywhere where it would be natural to use the word "type" I use the word kind instead.

I won't bother explaining the symbol table too much, it's just ordinary programming. You should read the code thoroughly until you understand it. Here is symboltable.pb: http://pastebin.ca/1590601
Edit: Don't this one, go to the top of this post to find a new version!

To actually use the symboltable we need an example program. I made a new file "symsimple.pb", and here it is: http://pastebin.ca/1590600

It accepts two kind of statements: assignments and variable declarations.
assignment ::= <name> = <expression>
vardecl ::= var <name> as <typename>

The only allowed typename for now is "long". An expression consists of a single value, which can be an integer or a variable.

Changes:
- The new VarRef() does the heavy work of what VarValue() did before, however, it takes a pointer to SVariable. From the rest of the code, simply call VarValue() as before, and it will call VarRef() for you.

- VarValue() has been changed to see if the variable is declared (if it can be looked up from the symbol table, it is declared). If it's declared, it calls VarRef() to do the heavy lifting. Else it shows an error message.

- Statement() distinguishes between two statements, in case of a variable declaration it calls the procedure VariableDeclaration().

- VariableDeclaration() reads the name of the variable, followed by the keyword "as" followed by the type name. It then calls a function in the symboltable.pb to add the variable there. AddVariable() will give an error if this variable was already declared.

- GlobalAsmDefinition() takes two parameters, of which only one is useful to us. It is called once for each entry in the symbol table. If the entry represents a global variable, it emits the asm code to allocate space for this variable.

- Footer() is called after the whole program has been compiled. It calls EnumerateSymbols() with the parameter GlobalAsmDifinition() to enable allocation of space for global variables.
Before that it calls _ExitProcess, else we would start executing the data in the global variables when reaching the end of the program.

A test run of the symsimple.pb could now look like this:

var a as long
var b as long
a = 5
b = a
b = 8

--------------------------
        mov  eax, 5
        mov  [v_a], eax
        mov  eax, [v_a]
        mov  [v_b], eax
        mov  eax, 8
        mov  [v_b], eax
        push 0
        call _ExitProcess
        ; Global variables:
        v_a dd 0
        v_b dd 0

Notice how you get an error message when using undeclared variables, and also when declaring variables twice.

11 September 2009

13. Loops

The tutorial on if's was quite long, so i'll try to make this a short one. What we'll do is to make some loops. First the while loop.

The relevant parts of the grammar is as following:

...
while     ::= while <expression> <block> wend
statement ::= ( <if> | <while> | <assignment> ) <newline>
...

Make a copy of the file with the if statements and call it while.pb.

The first thing to do, is to add the follow token "wend" to the list recognized in Block():


      ...
      Case "endif", "else", "wend"
      ...

Then modify Statement() to recognize a while statement:


Procedure Statement(Name.s)
  ; statement ::= ( <if> | <while> | <assignment> ) <newline>
  Select Name
    Case "if":    DoIf()
    Case "while": DoWhile()
    Default:      Assignment(Name)
  EndSelect
  Match(#LF)
EndProcedure

A while loop is just like an if statement, with a goto start at the end:

while expression
  ...
wend

equals

ifstart:
if expression
  ...
  jmp ifstart
endif

Thus, DoWhile() will be pretty similar to the version of DoIf() that didn't support else.

Procedure DoWhile()
  ; while     ::= while <expression> <block> wend
  Protected L1.s = NewLabel()
  Protected L2.s = NewLabel()
  Protected Follow.s
  
  Emit(L1 + ":")
  Expression()
  Emit("cmp  eax, 0")
  Emit("jz   " + L2)
  Follow = Block()
  Emit("jmp  " + L1)
  Emit(L2 + ":")
  
  If Follow <> "wend"
    Expected("wend, got " + Follow)
  EndIf
EndProcedure

While we're at it, let's implement a for loop. Our for loop will have the following syntax:

for ::= for <variable> = <expression> to <expression> <block> next

A for loop works like this:


First the result of expression 1 is assigned to the variable.
Then we do this:
LblStart:
if expression 2 >= variable
  ... code within for loop goes here
  variable = variable + 1
  jmp LblStart
endif
LblStop:

The loop part (after the initial assignment) can also be written as a while loop:

while expression 2 >= variable
  variable = variable + 1
wend

You need to add "next" to the list of follow tokens in block, and add "for" to statement() to call DoFor(). Then write DoFor():

Procedure DoFor()
  ; for ::= for <variable> = <expression>(1) to <expression>(2) <block> next
  Protected Variable.s
  Protected Temp.s
  Protected Follow.s
  Protected LblStart.s = NewLabel()
  Protected LblStop.s  = NewLabel()
  
  Variable = GetName()
  MatchWhite('=')
  Expression() ; (1)
  Emit2("mov  ", VarValue(Variable), "eax")
  
  Emit(LblStart + ":")
  
  Temp = GetName()
  If Temp <> "to"
    Expected("To, got " + Temp)
  EndIf
  
  Expression() ; (2)
  Emit2("cmp  ", "eax", VarValue(Variable))
  Emit( "jl   " + LblStop)
  
  Follow = Block()
  If Follow <> "next"
    Expected("Next, got " + Follow)
  EndIf
  
  Emit("inc  " + VarValue(Variable))
  Emit("jmp  " + LblStart)
  
  Emit(LblStop + ":")
EndProcedure

Read the code thorougly until you understand it. Step through it with the debugger if necessary.

As you can see, we must go through Block() (and Statement()) to open a for loop. Since the for loop calls Block() for its body, and it uses no global or static variables, we already have support for nested for loops, and if's within for's and for's within if's.

Exercise:
1. Write the grammar and PureBasic code to compile a loop like PureBasic's Repeat ... Until .
2. Write the grammar and PureBasic code to compile a loop that looks like a while loop, but the expression should only be evaluated once, and it should be the number of times that the loop loops.
Example of such a loop:

I = 10 (assignment statement is not part of the loop)
Times I
  Print "Hello " + Str(I) (this is some dummy code for the Block())
Loop

The loop would run 10 times, each time printing "Hello 10". (I should not be changed during the loop.) Of course, you only implement the loop, not the print function.
Remember to allow nesting of loops. Hint: The loop counter may be stored on the stack.

Here is the listing (I added variables and parentheses as described in another chapter):


; for.pb
IncludeFile "..\base2.pb"
IncludeFile "..\labels.pb"

Procedure ToBool(reg.s)
  Emit("cmp  " + reg + ", 0")
  Emit("mov  ecx, 0")
  Emit("setnz cl")
  Emit("mov  " + reg + ", ecx")
EndProcedure

Procedure RelOp(Op.s)
  Emit("cmp  ebp, eax")
  Emit("mov  eax, 0")
  Select Op
    Case "=":   Emit("sete al")   ; set if equal
    Case "<":   Emit("setl al")   ; set if less 
    Case "<=":  Emit("setle al")  ; less or equal
    Case ">":   Emit("setg al")   ; set if greater
    Case ">=":  Emit("setge al")  
    Case "<>":  Emit("setne al")  ; set if not equal
    Default
      Expected("Relational operator, got " + Op)
  EndSelect
EndProcedure

Procedure.s VarValue(Var.s)
  ProcedureReturn "[v_" + Var + "]"
EndProcedure

Declare Expression()
Procedure Value()
  ; value  ::= <integer> | <variable> | '(' <expression> ')'
  If IsDigit(Look)
    Emit("mov  eax, " + GetInteger())
  ElseIf IsAlpha(Look)
    Emit("mov  eax, " + VarValue(GetName()))
  ElseIf Look = '('
    MatchWhite('(')
    Expression()
    MatchWhite(')')
  Else
    Expected("value")
  EndIf
EndProcedure

Procedure MulExp()
  ; mulexp     ::= <value> {<mulop> <value>}
  Protected Operation.c
  Value()
  While Look = '*' Or Look = '/' Or Look = '%'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    Value()
    Emit("pop  ebp")
    Select Operation
      Case '*': Emit("imul eax, ebp")
      Case '/', '%'
                Emit("xchg eax, ebp")
                Emit("cdq") ; expands eax to edx:eax
                Emit("idiv ebp")
    EndSelect
    If Operation = '%'
      Emit("mov  eax, edx")
    EndIf
  Wend
EndProcedure

Procedure AddExp()
  ; addexp     ::= <mulexp> {<addop> <mulexp> }
  Protected Operation.c
  MulExp()
  While Look = '+' Or Look = '-'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    MulExp()
    Emit("pop  ebp")
    Select Operation
      Case '+': Emit("add  eax, ebp")
      Case '-': Emit("sub  eax, ebp")
                Emit("neg  eax")
    EndSelect
  Wend
EndProcedure

Procedure RelExp()
  ; relexp     ::= <addexp> { <boolop> <addexp> }
  Protected Operation.s
  AddExp()
  While Look = '<' Or Look = '>' Or Look = '='
    Operation = Chr(Look)
    GetLook()
    Select Look
      Case '<', '>', '='
        Operation + Chr(Look)
        GetLook()
    EndSelect
    EatWhite()
    Emit("push eax")
    AddExp()
    Emit("pop  ebp")
    RelOp(Operation) ; Generate code
  Wend
EndProcedure

Procedure BoolExp()
  ; boolexp    ::= <relexp> { <boolop> <relexp> }
  Protected Operation.s
  RelExp()
  While Look = 'a' Or Look = 'o' Or Look = 'x'
    Operation = GetName()
    Emit("push eax")
    RelExp()
    Emit("pop  ebp")
      ToBool("eax")
      ToBool("ebp")
      Select Operation
        Case "and": Emit("test eax, ebp")
        Case "or":  Emit("or   eax, ebp")
        Case "xor": Emit("xor  eax, ebp")
        Default
          Expected("Operator, got " + Operation)
      EndSelect
      Emit("mov  eax, 0")
      Emit("setnz al")
  Wend
EndProcedure

Procedure Expression()
  ; expression ::= <BoolExp>
  BoolExp()
EndProcedure

Procedure Assignment(Name.s)
  ; assignment ::= <name> = <expression>
  MatchWhite('=')
  Expression()
  Emit2("mov  ", VarValue(Name), "eax")
EndProcedure

Declare.s Block()

Procedure.s DoElse(Label.s)
  Protected Label2.s = NewLabel()
  Protected Follow.s
  
  Emit("jmp  " + Label2)
  Emit(Label + ":")
  Follow = Block()
  Emit(Label2 + ":")
  
  ProcedureReturn Follow
EndProcedure

Procedure DoIf()
  Protected Label.s
  Protected FollowToken.s
  Label = NewLabel()  ; Used when jumping around the block
  
  Expression()        ; The conditional expression (do we jump or not?)
  Emit("cmp  eax, 0") ; Check if the expression is true or false
  Emit("jz   " + Label) ; if it's false (zero) jump to label
  FollowToken = Block() ; else fall through to this block (which is always compiled)
  
  If FollowToken = "else"
    FollowToken = DoElse(Label)
  Else
    Emit(Label + ":")
  EndIf
  
  If FollowToken <> "endif"
    Expected("endif, got " + FollowToken)
  EndIf
EndProcedure

Procedure DoWhile()
  ; while     ::= while <expression> <block> wend
  Protected L1.s = NewLabel()
  Protected L2.s = NewLabel()
  Protected Follow.s
  
  Emit(L1 + ":")
  Expression()
  Emit("cmp  eax, 0")
  Emit("jz   " + L2)
  Follow = Block()
  Emit("jmp  " + L1)
  Emit(L2 + ":")
  
  If Follow <> "wend"
    Expected("wend, got " + Follow)
  EndIf
EndProcedure

Procedure DoFor()
  ; for ::= for <variable> = <expression>(1) to <expression>(2) <block> next
  Protected Variable.s
  Protected Temp.s
  Protected Follow.s
  Protected LblStart.s = NewLabel()
  Protected LblStop.s  = NewLabel()
  
  Variable = GetName()
  MatchWhite('=')
  Expression() ; (1)
  Emit2("mov  ", VarValue(Variable), "eax")
  
  Emit(LblStart + ":")
  
  Temp = GetName()
  If Temp <> "to"
    Expected("To, got " + Temp)
  EndIf
  
  Expression() ; (2)
  Emit2("cmp  ", "eax", VarValue(Variable))
  Emit( "jl   " + LblStop)
  
  Follow = Block()
  If Follow <> "next"
    Expected("Next, got " + Follow)
  EndIf
  
  Emit("inc  " + VarValue(Variable))
  Emit("jmp  " + LblStart)
  
  Emit(LblStop + ":")
EndProcedure

Procedure Statement(Name.s)
  ; statement ::= ( <if> | <while> | <assignment> ) <newline>
  Select Name
    Case "if":    DoIf()
    Case "while": DoWhile()
    Case "for":   DoFor()
    Default:      Assignment(Name)
  EndSelect
  Match(#LF)
EndProcedure

Procedure.s Block()
  ; block      ::= { <statement> }
  Protected Name.s
  EatWhiteNewlines()
  While Look <> 0
    Name = GetName()
    Select Name
      Case "endif", "else", "wend", "next"
        ProcedureReturn Name
      Default
        Statement(Name)
    EndSelect
    EatWhiteNewlines()
  Wend
EndProcedure

Procedure Program()
  ; program    ::= <block>
  EatWhiteNewlines() ; Allow for whitespace before program start
  Protected Follow.s = Block()
  If Follow <> ""
    Expected("Dangling " + Follow)
  EndIf
EndProcedure

InitMulti()
Program()
Input()

12. Multiline and If statements

I decided to implement if statements onto boolrel.pb. This way we have boolean operators available. However, in this file we did not yet write support for variables or parentheses. It will be your job to fit all the parts together. Even so, I recommend that you do this exactly as I do to learn it first.

If statements only make sense when there is two different code paths to choose from. This means we must implement support for compiling a whole program consisting of multiple lines. It could be represented by this BNF grammar:

assignment ::= <name> = <expression>
statement  ::= <assignment> <newline>
block      ::= { <statement> }
program    ::= <block>

You can see that we have implemented support for multiple statements (each on a single line) in the "rule" (procedure) block. A statement can be an assignment statement, an if statement, a while statement, a variable declaration, and so on. In the grammar above, the only allowed kind of statement is an assignment.

Let's add another kind of statement, the if statement:

assignment ::= <name> = <expression>
if         ::= if <expression> <block> endif
statement  ::= ( <ifst> | <assignment> ) <newline>
block      ::= { <statement> }
program    ::= <block>

An if statement starts with the literal keyword if (this is not the same as the rule <if>). Then comes an expression, which is evaluated to tell us whether to jump around the if or execute the block of code inside it. This is followed by a block (which is simply a list of statements) and ended with the keyword endif. For now, we don't have an else clause.

I want you to pay close attention to the recursive nature of the definition. A block contains a list of statements, some of can be if statements. Each if statement contains a block, which can also consists of if statements. *This means we already have support for nesting ifs!*

Of course, provided we write the code.

If you payed enough attention to the grammar, you would have noticed that it is a bit reduntant. A program is simply defined as a single block. In theory, we could simply delete the rule program and rename the rule block into program.
In practice, this is not so. Because we must handle the so-called "follow tokens" (endif, wend, while, endprocedure) somewhere. I thought block was a good place for this. But at the end of a program there is no follow token, so while block must return on a follow token, program must give an error message ("endif without if"). Thus it is more elegant to use different procedures.
Also, it is just nonsense to say that an if statement consists of the keyword if, followed by a _program_, and then endif. It is more correct to say the if keyword is followed by a block.

First of all we must remove the obnoxious check for end of input from Expression(). This is what it should look like:

Procedure Expression()
  ; expression ::= <BoolExp>
  BoolExp()
EndProcedure

In the assignment statement we fake support for variables, in a real compiler use a procedure similar to Assignment() from the previous chapter. A difference is that we get the name as a parameter. This is because Block() and Statement() need to examine the name first, to see if they need to handle it in a different way. For instance, if the name was "if" or "endif", then this would not be an assignment statement, and Assignment() would never be called.

Procedure Assignment(Name.s)
  ; assignment ::= <name> = <expression>
  MatchWhite('=')
  Expression()
  ; I simply implemented support for variables here
  ; Normally, you would call VarValue() as shown in the
  ; previous tutorials, but we don't have VarValue() in
  ; this program (to not clutter it too much)
  Emit("mov  [v_" + Name + "], eax")
EndProcedure

For now, we'll just a stub for the DoIf() procedure:

Procedure DoIf()
  Emit("; do if")
EndProcedure

Don't be tempted to name this procedure If(). Bonus points if you guess why.

Then the procedure Statement().

Procedure Statement(Name.s)
  ; statement  ::= ( <ifst> | <assignment> ) <newline>
  ; alternatively,
  ; statement  ::= <ifst> <newline> | <assignment> <newline>
  Select Name
    Case "if": DoIf()
    Default:   Assignment(Name)
  EndSelect
  Match(#LF)
EndProcedure

Statement is simple: Either it's an if statement, or it's an assignment statement. It then calls the relevant procedure to take care of the details.
As you can see, we require a linefeed at the end of each statement. This is a common rule in BASIC-like languages. In C-like languages, a semicolon is used for this purpose. I purposely don't match #CR, because it's too much trouble. Instead we'll make sure that #CR is simply dropped from the input, leaving us with only #LF.

Procedure.s Block()
  ; block      ::= { <statement> }
  Protected Name.s
  EatWhiteNewlines()
  While Look <> 0
    Name = GetName()
    Select Name
      Case "endif"
        ProcedureReturn Name
      Default
        Statement(Name)
    EndSelect
    EatWhiteNewlines()
  Wend
EndProcedure

Block() gets a name and checks if it is a follow token. If it is, then Block() returns the follow token. In this way, a follow token marks the end of a block, which is correct.
If the name is not recognized, Block() passes the name on to Statement(). Block() keeps doing this until the end of the input stream is reached (but remember, it may return before, if it encounters an "endif").
EatWhiteNewlines() is a new procedure that eats all whitespace and newlines. This is to allow for space between the statements in our program.

The rule program should be easy to implement as a procedure:

Procedure Program()
  ; program    ::= <block>
  Protected Follow.s = Block()
  If Follow <> ""
    Expected("Dangling " + Follow)
  EndIf
EndProcedure

It is simply a block. Do you remember that Block() was supposed to return a follow token? Program() checks for this, and if Block() returns "endif", "endprocedure" or anything like that, we give an error message.
Since Block() simply returns an empty string if it reaches the end of the file, Program() shows no error in this case. Because after all, the end of the program is when we reach the end of the file, and we shouldn't give an error for that, since it's correct.

There is a huge problem with our program as a whole. Which is why we are still using a dummy for DoIf(). What's the problem? The compiler expects an input of several lines. We can only input one line!

Normally, one would read from a file. You can do that if you want to, just change Init() and GetLook(), and be sure to skip #CR characters. I chose to add support for entering several lines in the console instead, for easier experimentation. Simply add these procedures to base2.pb, and call InitMulti() instead of Init().

Procedure IsNewline(C.c)
  If Look = #LF
    ProcedureReturn 1
  EndIf
EndProcedure

Procedure EatWhiteNewlines()
  While IsWhite(Look) Or IsNewline(Look)
    GetLook()
  Wend
EndProcedure

Procedure InitMulti()
  OpenConsole()
  Protected TempIn.s
  Repeat
    TempIn = Input()
    Instream + TempIn + #LF$
  Until TempIn = ""
  PrintN("--------------------------")
  GetLook()
EndProcedure

Now you should test that multi-line input works. Call InitMulti() and Program() instead of Init() and Expression():

InitMulti()
Program()
Input()

Your console window should look like this:
a = 5
g = 7
f = 5

--------------------------
mov eax, 5
mov [v_a], eax
mov eax, 7
mov [v_g], eax
mov eax, 5
mov [v_f], eax

As you can see, we terminate the input with an empty line.

Now let's get to the if statement.

if         ::= if <expression> <block> endif

What we need to do is to jump around the block if the expression is false. To jump in assembly, we need to have a label to jump to. Which means we need a way to generate unique labels. I made a separate file for this, labels.pb, which I included at the top of the program.

; This file generates unique labels for use in the program

Procedure.s NewLabel()
  Static LabelCounter
  LabelCounter + 1
  ; The cl_ prefix stands for compiler label
  ; to avoid a name clash with labels used in the language
  ; (which we can give the prefix l_)
  ProcedureReturn "cl_" + Str(LabelCounter)
EndProcedure

Very simple, it just increases a counter and generates labels like cl_1, cl_2, cl_3, cl_10593 and so on.

Here's DoIf(), finally:

Declare.s Block()
Procedure DoIf()
  Protected Label.s
  Protected FollowToken.s
  Label = NewLabel()  ; Used when jumping around the block
  
  Expression()        ; The conditional expression (do we jump or not?)
  Emit("cmp  eax, 0") ; Check if the expression is true or false
  Emit("jz   " + Label) ; if it's false (zero) jump to label
  FollowToken = Block() ; else fall through to this block (which is always compiled)
  
  Emit(Label + ":")
  
  If FollowToken <> "endif"
    Expected("endif, got " + FollowToken)
  EndIf
EndProcedure

We create a new label with our NewLabel() procedure. Then we get an ordinary expression, and compares it to 0 to check if it's true or false. If it's zero (false), we jump to the label.
In either case, we compile a block. The block returns the name that caused it to return, or nothing if we are at the end of the file. If this is an empty string or anything else than endif, the program has an error and compilation must be stopped.
At last, we emit the label, our jump target.

Notice again that DoIf() calls Block(), which may call DoIf() again if required to have nested ifs without any extra work on our part. This is the result of a well thought-out grammar and following the grammar exactly when implementing.

The compiler should now accept inputs like this:

if 6 and 2 < 8
  a = 1
  if 1
    b = 2
  endif
endif

Be sure to test if the generated code gives the expected result. When using labels from asm in PureBasic you need to add a ! before the label definition, or add l_ before all references to the label (because PB puts l_ in front of all BASIC labels). Note that you can change the numbers in the generated code to see if works for all inputs, rather than re-entering the whole thing with different numbers.

Now we'll implement else. First we must set Block() to recognize this as a follow token. Simply add ", "else"" (without the quotes) to the case line like this:

    Case "endif", "else"

Then we must, in DoIf(), check if the follow token returned from Block() is "else". If it is, we call a new procedure, DoElse(), and also don't output the label.

Procedure DoIf()
  Protected Label.s
  Protected FollowToken.s
  Label = NewLabel()  ; Used when jumping around the block
  
  Expression()        ; The conditional expression (do we jump or not?)
  Emit("cmp  eax, 0") ; Check if the expression is true or false
  Emit("jz   " + Label) ; if it's false (zero) jump to label
  FollowToken = Block() ; else fall through to this block (which is always compiled)
  
  If FollowToken = "else" ; New code
    FollowToken = DoElse(Label)
  Else
    Emit(Label + ":")
  EndIf
  
  If FollowToken <> "endif"
    Expected("endif, got " + FollowToken)
  EndIf
EndProcedure

DoElse() is going to call Block() also, and will return the follow token, which should still be endif.

Procedure DoElse(Label.s)
  Protected Label2.s = NewLabel()
  Protected Follow.s
  
  Emit("jmp  " + Label2)
  Emit(Label + ":")
  Follow = Block()
  Emit(Label2 + ":")
  
  ProcedureReturn Follow
EndProcedure

We need a new label, so that we can jump around the else branch (from the if branch). The first thing we do is to emit this a jump to this label. This jump is actually not in the else branch, but at the end of the if branch. Then we emit the label that is jumped to when the if expression is false (else branch).
After this label, we place the code from a Block(). Finally we emit label2, so that the code from the if branch can jump around the else branch.
And then we return Follow so that it can be checked in DoIf(), because having the check both here and in DoIf() would be unelegant.

Done. With a lot support code and a tiny bit of actual if code, we now have if .. endif and if .. else .. endif. If you merge this with support for variables and parentheses you're on your a way to a complete programming language!


; if3.pb
IncludeFile "..\base2.pb"
IncludeFile "..\labels.pb"

Procedure ToBool(reg.s)
  Emit("cmp  " + reg + ", 0")
  Emit("mov  ecx, 0")
  Emit("setnz cl")
  Emit("mov  " + reg + ", ecx")
EndProcedure

Procedure RelOp(Op.s)
  Emit("cmp  ebp, eax")
  Emit("mov  eax, 0")
  Select Op
    Case "=":   Emit("sete al")   ; set if equal
    Case "<":   Emit("setl al")   ; set if less 
    Case "<=":  Emit("setle al")  ; less or equal
    Case ">":   Emit("setg al")   ; set if greater
    Case ">=":  Emit("setge al")  
    Case "<>":  Emit("setne al")  ; set if not equal
    Default
      Expected("Relational operator, got " + Op)
  EndSelect
EndProcedure

Procedure Value()
  ; value  ::= <integer>
  Emit("mov  eax, " + GetInteger())
EndProcedure

Procedure MulExp()
  ; mulexp     ::= <value> {<mulop> <value>}
  Protected Operation.c
  Value()
  While Look = '*' Or Look = '/' Or Look = '%'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    Value()
    Emit("pop  ebp")
    Select Operation
      Case '*': Emit("imul eax, ebp")
      Case '/', '%'
                Emit("xchg eax, ebp")
                Emit("cdq") ; expands eax to edx:eax
                Emit("idiv ebp")
    EndSelect
    If Operation = '%'
      Emit("mov  eax, edx")
    EndIf
  Wend
EndProcedure

Procedure AddExp()
  ; addexp     ::= <mulexp> {<addop> <mulexp> }
  Protected Operation.c
  MulExp()
  While Look = '+' Or Look = '-'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    MulExp()
    Emit("pop  ebp")
    Select Operation
      Case '+': Emit("add  eax, ebp")
      Case '-': Emit("sub  eax, ebp")
                Emit("neg  eax")
    EndSelect
  Wend
EndProcedure

Procedure RelExp()
  ; relexp     ::= <addexp> { <boolop> <addexp> }
  Protected Operation.s
  AddExp()
  While Look = '<' Or Look = '>' Or Look = '='
    Operation = Chr(Look)
    GetLook()
    Select Look
      Case '<', '>', '='
        Operation + Chr(Look)
        GetLook()
    EndSelect
    EatWhite()
    Emit("push eax")
    AddExp()
    Emit("pop  ebp")
    RelOp(Operation) ; Generate code
  Wend
EndProcedure

Procedure BoolExp()
  ; boolexp    ::= <relexp> { <boolop> <relexp> }
  Protected Operation.s
  RelExp()
  While Look = 'a' Or Look = 'o' Or Look = 'x'
    Operation = GetName()
    Emit("push eax")
    RelExp()
    Emit("pop  ebp")
      ToBool("eax")
      ToBool("ebp")
      Select Operation
        Case "and": Emit("test eax, ebp")
        Case "or":  Emit("or   eax, ebp")
        Case "xor": Emit("xor  eax, ebp")
        Default
          Expected("Operator, got " + Operation)
      EndSelect
      Emit("mov  eax, 0")
      Emit("setnz al")
  Wend
EndProcedure

Procedure Expression()
  ; expression ::= <BoolExp>
  BoolExp()
EndProcedure

Procedure Assignment(Name.s)
  ; assignment ::= <name> = <expression>
  MatchWhite('=')
  Expression()
  ; I simply implemented support for variables here
  ; Normally, you would call VarValue() as shown in the
  ; previous tutorials, but we don't have VarValue() in
  ; this program (to not clutter it too much)
  Emit("mov  [v_" + Name + "], eax")
EndProcedure

Declare.s Block()

Procedure.s DoElse(Label.s)
  Protected Label2.s = NewLabel()
  Protected Follow.s
  
  Emit("jmp  " + Label2)
  Emit(Label + ":")
  Follow = Block()
  Emit(Label2 + ":")
  
  ProcedureReturn Follow
EndProcedure

Procedure DoIf()
  Protected Label.s
  Protected FollowToken.s
  Label = NewLabel()  ; Used when jumping around the block
  
  Expression()        ; The conditional expression (do we jump or not?)
  Emit("cmp  eax, 0") ; Check if the expression is true or false
  Emit("jz   " + Label) ; if it's false (zero) jump to label
  FollowToken = Block() ; else fall through to this block (which is always compiled)
  
  If FollowToken = "else"
    FollowToken = DoElse(Label)
  Else
    Emit(Label + ":")
  EndIf
  
  If FollowToken <> "endif"
    Expected("endif, got " + FollowToken)
  EndIf
EndProcedure

Procedure Statement(Name.s)
  ; statement  ::= ( <ifst> | <assignment> ) <newline>
  ; alternatively,
  ; statement  ::= <ifst> <newline> | <assignment> <newline>
  Select Name
    Case "if": DoIf()
    Default:   Assignment(Name)
  EndSelect
  Match(#LF)
EndProcedure

Procedure.s Block()
  ; block      ::= { <statement> }
  Protected Name.s
  EatWhiteNewlines()
  While Look <> 0
    Name = GetName()
    Select Name
      Case "endif", "else"
        ProcedureReturn Name
      Default
        Statement(Name)
    EndSelect
    EatWhiteNewlines()
  Wend
EndProcedure

Procedure Program()
  ; program    ::= <block>
  EatWhiteNewlines() ; Allow for whitespace before program start
  Protected Follow.s = Block()
  If Follow <> ""
    Expected("Dangling " + Follow)
  EndIf
EndProcedure

InitMulti()
Program()
Input()

06 September 2009

11. Optimization: Register stack (and assignments)

Today we will do the most fun part of all programming: premature optimization. Remember that the current code generation is pretty lousy and riddled with tons of pushes and pops? We'll get rid of them now.

Instead of using the stack we will use more registers. When we would normally push a value onto the stack to get space for a new value, we will now simply take a new register for the new value.

I made a new register manager (registers.pb) that allows us to easily get unused registers and dispose them when done with them. It's really simple.

#RegMax = 7 ; Highest valid register ID

Structure SRegister
  Name.s
  Used.i
EndStructure

Global Dim Registers.SRegister(#RegMax)
Registers(0)\Name = "Error: Null register"
Registers(1)\Name = "eax"
Registers(2)\Name = "edx"
Registers(3)\Name = "ecx"
Registers(4)\Name = "ebx"
Registers(5)\Name = "edi"
Registers(6)\Name = "ebp"
Registers(7)\Name = "esi"

Procedure RegGet()
  Protected I
  For I = 1 To #RegMax
    If Registers(I)\Used = 0
      Registers(I)\Used = 1
      ProcedureReturn @Registers(I)
    EndIf
  Next
  Error("Too complex expression (out of cpu registers)")
EndProcedure

Procedure.s RegStr(*Reg.SRegister)
  ProcedureReturn *Reg\Name
EndProcedure

Procedure RegDispose(*Reg.SRegister)
  *Reg\Used = 0
EndProcedure

Procedure.s RegUse(*Reg.SRegister)
  RegDispose(*Reg)
  ProcedureReturn RegStr(*Reg)
EndProcedure

This code is rather simple. RegGet() finds an unused register and returns a pointer to the corresponding item in the array. RegStr() converts the pointer into a string. RegDispose() sets the state of the register to unused again. RegUse() returns the corresponding string and sets the register as unused.

We will now implement this on var.pb. Copy var.pb into regstack.pb. But before we go any further, we need a macro to simplify code generation a bit. Put this into base2.pb:

Macro Emit2(Instruction, Op1, Op2)
  Emit(Instruction + Op1 +", " + Op2)
EndMacro

It works just like Emit(), except it takes separate parameters.

Now on to the real deal. AddExp() should now be able to use two arbitrary registers instead of eax and ebp. But how can it know _which_ registers? A good idea here is to let each procedure return which registers it left its result in. So if the result of MulExp() is in eax, it should return a pointer that, when passed to RegStr(), gives the string eax.

Here is the new AddExp():

Procedure AddExp()
  ; addexp     ::=  {+  }
  Protected *R1, *R2
  *R1 = MulExp()
  While Look = '+'
    GetLookWhite()
    *R2 = MulExp()
    Emit2("add  ", RegStr(*R1), RegUse(*R2))
  Wend
  ProcedureReturn *R1
EndProcedure

As you can see, the first register is the result of the first call to MulExp(). If Look is not +, then that register also contains the result of AddExp(). If Look IS +, then we take the other operand from MulExp(), perform an add operation (where we discard the register *R2, since we have the result in *R1), and the result is still in *R1. So we return *R1 at the end of the procedure.

AddExp() never allocates any registers, because it gets them from MulExp(). Which is nice and easy on AddExp(), but where does MulExp() get the registers from? Look here:

Procedure MulExp()
  ; mulexp     ::=  {* }
  Protected *R1, *R2
  *R1 = Value()
  While Look = '*'
    GetLookWhite()
    *R2 = Value()
    Emit2("imul ", RegStr(*R1), RegUse(*R2))
  Wend
  ProcedureReturn *R1
EndProcedure

MulExp() gets its registers from Value(). Which is the procedure that needs the biggest changes:

Procedure Value()
  ; value  ::=  |  | '('  ')'
  Protected *R1
  If IsDigit(Look)
    *R1 = RegGet()
    Emit2("mov  ", RegStr(*R1), GetInteger())
  ElseIf IsAlpha(Look)
    *R1 = RegGet()
    Emit2("mov  ", RegStr(*R1), VarValue(GetName()))
  ElseIf Look = '('
    MatchWhite('(')
    *R1 = Expression()
    MatchWhite(')')
  EndIf
  ProcedureReturn *R1
EndProcedure

As you can see, Value() allocates a new register to put its value into, except when the value is the result of an expression, which must of course already be in an allocated register. Speaking of which, Expression() must be changed to return which register the result is in:

Procedure Expression()
  ; expression ::= <AddExp>
  ProcedureReturn AddExp()
EndProcedure

This may be a bit confusing. Expression() gets its register from AddExp(), which gets its registers from MulExp(), which gets its register from Value(), which, in turn, gets its register from Expression()! This seems like it may run for ever. But Value() will only get its result from Expression() when it sees an opening parenthesis, so unless the source contains an unlimited amount of opening parenthesis (or we forget to throw them away with MatchWhite()) then our program will not run for ever.

Since we are on a roll now, why not implement an assignment statement to round it off nicely? Here is the syntax:

; assignment  ::= <name> = <expression>

That should be simple:

Procedure Assignment()
  ; assignment  ::= <name> = <expression>
  Protected *R
  Protected Name.s
  Name = GetName()
  MatchWhite('=')
  *R = Expression()
  Emit2("mov  ", VarValue(Name), RegUse(*R))
EndProcedure

I want to point out that for an expression like (543+54)*(28+48), we have reduced the number of instructions from 13 to 7. Additionally, using only registers are faster than using the stack, because the stack means memory accesses (slow).

Here is the listing (note that it accepts assignment statements only):

; regstack.pb
IncludeFile "..\base2.pb"
IncludeFile "..\registers.pb"

Procedure.s VarValue(Var.s)
  ProcedureReturn "[v_" + Var + "]"
EndProcedure

Declare Expression()
Procedure Value()
  ; value  ::=  |  | '('  ')'
  Protected *R1
  If IsDigit(Look)
    *R1 = RegGet()
    Emit2("mov  ", RegStr(*R1), GetInteger())
  ElseIf IsAlpha(Look)
    *R1 = RegGet()
    Emit2("mov  ", RegStr(*R1), VarValue(GetName()))
  ElseIf Look = '('
    MatchWhite('(')
    *R1 = Expression()
    MatchWhite(')')
  EndIf
  ProcedureReturn *R1
EndProcedure

Procedure MulExp()
  ; mulexp     ::=  {* }
  Protected *R1, *R2
  *R1 = Value()
  While Look = '*'
    GetLookWhite()
    *R2 = Value()
    Emit2("imul ", RegStr(*R1), RegUse(*R2))
  Wend
  ProcedureReturn *R1
EndProcedure

Procedure AddExp()
  ; addexp     ::=  {+  }
  Protected *R1, *R2
  *R1 = MulExp()
  While Look = '+'
    GetLookWhite()
    *R2 = MulExp()
    Emit2("add  ", RegStr(*R1), RegUse(*R2))
  Wend
  ProcedureReturn *R1
EndProcedure

Procedure Expression()
  ; expression ::= 
  ProcedureReturn AddExp()
EndProcedure

Procedure Assignment()
  ; assignment  ::=  = 
  Protected *R
  Protected Name.s
  Name = GetName()
  MatchWhite('=')
  *R = Expression()
  Emit2("mov  ", VarValue(Name), RegUse(*R))
EndProcedure

Init()
Assignment()
If Look <> 0
  Expected("End of line")
EndIf
Input()

04 September 2009

10. Variables

Copy paren.pb to var.pb. We will now implement support for variables in the expression parser. Here is the old grammar:

digit      ::= 0..9
integer    ::= <digit> {<digit>}
value      ::= <integer> | '(' <expression> ')'
mulop      ::= * | /
addop      ::= + | -
mulexp     ::= <value>  { <mulop> <value> }
addexp     ::= <mulexp> { <addop> <mulexp> }
expression ::= <addexp>

A variable can represent a value, right? In fact, we don't actually want the variable itself, we want the value stored in that variable.

So we can add a bit to our grammar:

name       ::= <alpha> { <alpha> | <digit> }
variable   ::= <name>
value      ::= <integer> | <variable> | '(' <expression> ')'

Adding support for this in Value() should be easy:

Procedure Value()
  ; value  ::= <integer> | <variable> | '(' <expression> ')'
  If IsDigit(Look)
    Emit("mov  eax, " + GetInteger())
  ElseIf IsAlpha(Look)
    Emit("mov  eax, " + VarValue(GetName()))
  ElseIf Look = '('
    MatchWhite('(')
    Expression()
    MatchWhite(')')
  EndIf
EndProcedure

We can see an extra procedure here: VarValue(). What this does is to generate asm code to get the value of a variable by using a memory reference (the square brackets around the name). I have chosen to prefix all global variable names with v_ to avoid name collisions with internal labels (which we need when generating if's and so on).

Procedure.s VarValue(Var.s)
  ProcedureReturn "[v_" + Var + "]"
EndProcedure

Done. This isn't getting any harder, is it? Of course we need to register all variables in a symbol table and actually allocate space for them, but that's easy anyways. Maybe I'll show it later.

02 September 2009

9. Parentheses

From now on I will not build on the same source file every time. Because it becomes so long and tedious to manage. Instead I will show you the concepts on a small scale, and you can then implement it yourself in a bigger project if you want to.

In this chapter, we will build on mul.pb, so make a copy of it (name it "parens.pb").
To have the same source as I do, search and replace ecx into ebp.

Here is the old grammar:

digit      ::= 0..9
integer    ::= <digit> {<digit>}
value      ::= <integer>
mulop      ::= * | /
addop      ::= + | -
mulexp     ::= <value>  { <mulop> <value> }
addexp     ::= <mulexp> { <addop> <mulexp> }
expression ::= <addexp>

What does value() do? It puts a value in eax.
What does expression() do? It puts a value in eax.
What is inside a parenthesis? Is it not just another ordinary expression? Yes.

So, why not do it like this?

value      ::= <integer> | '(' <expression> ')'

That's what we are going to do. Change Value() into this:

Declare Expression()
Procedure Value()
  ; value  ::= <integer> | '(' <expression> ')'
  If IsDigit(Look)
    Emit("mov  eax, " + GetInteger())
  ElseIf Look = '('
    MatchWhite('(')
    Expression()
    MatchWhite(')')
  EndIf
EndProcedure

We had to forward declare Expression() to avoid an error message (since it's defined later in the source code). Also you can see two calls to a new function here: MatchWhite(). What it does is to verify that Look is a certain character, discard Look, and then eat any following whitespace. Here's the implementation, put it in base2.pb:

Procedure Match(C.c)
  If Look <> C
    Expected(Chr(C))
  EndIf
  GetLook()
EndProcedure

Procedure MatchWhite(C.c)
  Match(C)
  EatWhite()
EndProcedure

Now that was simple, right? Unfortunately it's not quite right just yet. Currently Expression() gives an error message if it does not reach the end of the line. We must change that. Here is the entire code, with the end of line check moved out of Expression():


; paren.pb
IncludeFile "..\base2.pb"

; Procedure.s VarRef(Var.s)
;   ProcedureReturn "[v_" + Var + "]"
; EndProcedure

Declare Expression()
Procedure Value()
  ; value  ::= <integer> | '(' <expression> ')'
  If IsDigit(Look)
    Emit("mov  eax, " + GetInteger())
  ElseIf Look = '('
    MatchWhite('(')
    Expression()
    MatchWhite(')')
  EndIf
EndProcedure

Procedure MulExp()
  ; mulexp     ::= <value> {* <value>}
  Value()
  While Look = '*'
    GetLookWhite()
    Emit("push eax")
    Value()
    Emit("pop  ebp")
    Emit("imul eax, ebp")
  Wend
EndProcedure

Procedure AddExp()
  ; addexp     ::= <mulexp> {+ <mulexp> }
  MulExp()
  While Look = '+'
    GetLookWhite()
    Emit("push eax")
    MulExp()
    Emit("pop  ebp")
    Emit("add  eax, ebp")
  Wend
EndProcedure

Procedure Expression()
  ; expression ::= <AddExp>
  AddExp()
EndProcedure

Init()
Expression()
If Look <> 0
  Expected("End of line")
EndIf
Input()

01 September 2009

8. Boolean and Relational operators

Boolean and relational operators can be treated in a variety of ways. We will treat them in a way that is simple, easy and correct, but inefficient. Each operator will return the integer 1 for true and 0 for false. The language will consider 0 to be false and any other value (not just 1) to be true.

This is the absolutely easiest way of doing it, and also allows us to use these operators in freely in any expression, even for assignments: bigger = a > 0.

The first thing we need to do when adding new operators is to decide their precedence. I have decided to give boolean operators the lowest precedence (evaluated last), followed by relational operators, followed by mulops, and then addops. Because then we don't have to use parentheses when writing an expression like this:

If   a+1  >  6*b   And  5-c
If ((a+1) > (6*b)) And (5-c)

As you can see, the operators with the lowest precedence are evaluated last, because they are in the outermost implicit parentheses.
Speaking of which, we have forgotten to implement parentheses! I'll save that for the next installment.

Which operators do we need? We need And, Or, Xor, <, >, <=, >= and <>. Also, you may be thinking of Not. Not is a unary operator (takes only one argument), so it must be treated separately.

We will start with And. What it needs to do, is to look at both operands, and if both are true (different from 0), it should place 1 in eax. Else it should place 0 in eax.

It turns out ebp wasn't well suited, because we don't have separate access to the low 8-bit part of the register. I'll continue with ebp, but the code to convert a value to to a boolean can be made slightly faster if you use ecx instead. Since we do this for learning, we should not care about efficiency.

Here is the new grammar:

digit      ::= 0..9
integer    ::= <digit> { <digit> }
value      ::= <integer>
mulop      ::= * | /
addop      ::= + | -
relop      ::= < | > | <= | >= | <>
boolop     ::= And | Or | Xor
mulexp     ::= <value> {<mulop> <value>}
addexp     ::= <mulexp> {<addop> <mulexp> }
relexp     ::= <addexp> { <relop> <addexp> }
boolexp    ::= <relexp> {<boolop> <relexp> }
expression ::= <BoolExp>

We will use the following code in several places, so I placed it in a separate procedure:

Procedure ToBool(reg.s)
  Emit("cmp  " + reg + ", 0")
  Emit("mov  ecx, 0")
  Emit("setnz cl")
  Emit("mov  " + reg + ", ecx")
EndProcedure

It converts an integer into another integer, which may only be 0 and 1. Setnz sets cl (the lower part of ecx) to 1 if reg-0 was NOT zero. The result is that ecx becomes 0 when reg was 0, else ecx becomes 1.

BoolExp() looks just like the other procedures:

Procedure BoolExp()
  ; boolexp     ::= <relexp> {<boolop> <relexp> }
  Protected Operation.s
  RelExp()
  While Look = 'a' ; start of and
    Operation = GetName()
    Emit("push eax")
    RelExp()
    Emit("pop  ebp")
    Select Operation
      Case "and": ToBool("eax") ; Convert to 0 or 1 depending on
                  ToBool("ebp") ; truth value of input
                  Emit("and eax, ebp")
                  Emit("mov  eax, 0")
                  Emit("setnz al")
      Default
        Expected("Operator, got " + Operation)
    EndSelect
  Wend
EndProcedure

First, both operands are converted from their initial value into 0 and 1 (depending on their truth value). Then we perform a bitwise and operation, which works as a boolean and because of the limited input range (values are either 0 or 1).

You should now rewrite Expression() to call BoolExp() instead of AddExp(). Then write a dummy procedure, RelExp() that only calls AddExp(). You compiler should now have support for the And operator! Try it!

Note, that the compiler is currently case sensitive. Don't try to remedy this in BoolExp() - it should be done globally in the base file. If you want to fix it now, simply put an LCase() in GetLook(), and be sure to only expect lowercase letters in Look.

When you implement Or and Xor, you should use the or and xor cpu instructions instead of the and instruction. Since all these operators share a lot of code, we can make the code generation shared, too:

Procedure BoolExp()
  ; boolexp    ::= <relexp> { <boolop> <relexp> }
  Protected Operation.s
  RelExp()
  While Look = 'a' Or Look = 'o' Or Look = 'x'
    Operation = GetName()
    Emit("push eax")
    RelExp()
    Emit("pop  ebp")
      ToBool("eax")
      ToBool("ebp")
      Select Operation
        Case "and": Emit("and  eax, ebp")
        Case "or":  Emit("or   eax, ebp")
        Case "xor": Emit("xor  eax, ebp")
        Default
          Expected("Operator, got " + Operation)
      EndSelect
      Emit("mov  eax, 0")
      Emit("setnz al")
  Wend
EndProcedure

Cool. We added two new operators simply by moving four lines and adding two lines. (Ok, so I cheated by modifying the line with While...)

To avoid confusion I will must precise that these operators associate to the left, that is:
1 and 2 or 3
is the same as
(1 and 2) or 3
or verbally, the output of (1 and 2) becomes the input of or 3. Remember that the output of (1 and 2) is a number that is either 0 (false) or 1 (true).

This is the same for all operators we have implemented, however, this order is overridden by the precedence rules. So:
1+2+3 equals
(1+2)+3, however
1+2*3 equals
1+(2*3)

All normal operators are left associative, except one: ^ (raise to the power of). The power operator is right associative. Often this is implemented a normal function instead of an operator, this is because a lot of code is required to compute the result.

We will use the same idea for creating the relational operators as we used for above: the comparison will return either 0 or 1, depending on the input. To keep the code clean, I want to put the code generation for these operators in a separate procedure, RelOp(). This is because the code for recognizing the multi-character relops can get a bit long-winded. Here is RelExp():

Procedure RelExp()
  ; relexp     ::= <addexp> { <boolop> <addexp> }
  Protected Operation.s
  AddExp()
  While Look = '<' Or Look = '>' Or Look = '='
    Operation = Chr(Look)
    GetLook()
    Select Look
      Case '<', '>', '='
        Operation + Chr(Look)
        GetLook()
    EndSelect
    EatWhite()
    Emit("push eax")
    AddExp()
    Emit("pop  ebp")
    RelOp(Operation) ; Generate code
  Wend
EndProcedure

Here is the code for RelOp(), it's a standard Select statement to generate the correct conditional:

Procedure RelOp(Op.s)
  Emit("cmp  ebp, eax")
  Emit("mov  eax, 0")
  Select Op
    Case "=":   Emit("sete al")   ; set if equal
    Case "<":   Emit("setl al")   ; set if less 
    Case "<=":  Emit("setle al")  ; less or equal
    Case ">":   Emit("setg al")   ; set if greater
    Case ">=":  Emit("setge al")  
    Case "<>":  Emit("setne al")  ; set if not equal
    Default
      Expected("Relational operator, got " + Op)
  EndSelect
EndProcedure

Notice that the algorithm is quite similar to ToBool(), except we now compare with another number instead of 0. Setne and setnz are aliases for the same instructions, by the way.

I now realize that there is a mistake in base2.pb. Simply replace the procedure Expected() with this one:

Procedure Expected(S.s)
  Error("Expected " + S)
EndProcedure

Else the error message will be nonsensical.

A minor problem with our implementation of the boolean operators is that they are so called "strict". Normally boolean operators are "short-circuit". You should look up this on wikipedia or somewhere else if you don't know what it means.

The last thing you have to do before you're finished with this part of the tutorial is to put your new code through some testing. Remember to also test invalid inputs. They should give a proper error message.

Here is the complete listing:


; boolrel.pb
IncludeFile "..\base2.pb"

Procedure ToBool(reg.s)
  Emit("cmp  " + reg + ", 0")
  Emit("mov  ecx, 0")
  Emit("setnz cl")
  Emit("mov  " + reg + ", ecx")
EndProcedure

Procedure RelOp(Op.s)
  Emit("cmp  ebp, eax")
  Emit("mov  eax, 0")
  Select Op
    Case "=":   Emit("sete al")   ; set if equal
    Case "<":   Emit("setl al")   ; set if less 
    Case "<=":  Emit("setle al")  ; less or equal
    Case ">":   Emit("setg al")   ; set if greater
    Case ">=":  Emit("setge al")  
    Case "<>":  Emit("setne al")  ; set if not equal
    Default
      Expected("Relational operator, got " + Op)
  EndSelect
EndProcedure

Procedure Value()
  ; value  ::= <integer>
  Emit("mov  eax, " + GetInteger())
EndProcedure

Procedure MulExp()
  ; mulexp     ::= <value> {<mulop> <value>}
  Protected Operation.c
  Value()
  While Look = '*' Or Look = '/' Or Look = '%'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    Value()
    Emit("pop  ebp")
    Select Operation
      Case '*': Emit("imul eax, ebp")
      Case '/', '%'
                Emit("xchg eax, ebp")
                Emit("cdq") ; expands eax to edx:eax
                Emit("idiv ebp")
    EndSelect
    If Operation = '%'
      Emit("mov  eax, edx")
    EndIf
  Wend
EndProcedure

Procedure AddExp()
  ; addexp     ::= <mulexp> {<addop> <mulexp> }
  Protected Operation.c
  MulExp()
  While Look = '+' Or Look = '-'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    MulExp()
    Emit("pop  ebp")
    Select Operation
      Case '+': Emit("add  eax, ebp")
      Case '-': Emit("sub  eax, ebp")
                Emit("neg  eax")
    EndSelect
  Wend
EndProcedure

Procedure RelExp()
  ; relexp     ::= <addexp> { <boolop> <addexp> }
  Protected Operation.s
  AddExp()
  While Look = '<' Or Look = '>' Or Look = '='
    Operation = Chr(Look)
    GetLook()
    Select Look
      Case '<', '>', '='
        Operation + Chr(Look)
        GetLook()
    EndSelect
    EatWhite()
    Emit("push eax")
    AddExp()
    Emit("pop  ebp")
    RelOp(Operation) ; Generate code
  Wend
EndProcedure

Procedure BoolExp()
  ; boolexp    ::= <relexp> { <boolop> <relexp> }
  Protected Operation.s
  RelExp()
  While Look = 'a' Or Look = 'o' Or Look = 'x'
    Operation = GetName()
    Emit("push eax")
    RelExp()
    Emit("pop  ebp")
      ToBool("eax")
      ToBool("ebp")
      Select Operation
        Case "and": Emit("and  eax, ebp")
        Case "or":  Emit("or   eax, ebp")
        Case "xor": Emit("xor  eax, ebp")
        Default
          Expected("Operator, got " + Operation)
      EndSelect
      Emit("mov  eax, 0")
      Emit("setnz al")
  Wend
EndProcedure

Procedure Expression()
  ; expression ::= <BoolExp>
  BoolExp()
  
  If Look <> 0
    Expected("End of line")
  EndIf
EndProcedure

Init()
Expression()
Input()

30 August 2009

7. Subtraction and Division

These are easy. The instruction to subtract on x86 is sub. It is used like the add instruction. The only difference is that with subtraction, the order of the operands matters.
10-3 <> 3-10

The x86 instruction idiv is rather clunky. The format of a division is dividend/divisor. idiv requires that the dividend is twice as large as the division. So if the divisor is a long, then the dividend must be a quad. And also, the dividend must reside in eax. Now eax doesn't actually hold a quad, so in this case, idiv requires that one part of the quad is in edx and the other part is in eax!
The result is always placed in eax. Additionally, the remainder is placed in edx.
Also we have to be sure to get the order of the operands correct here as well, since 10/2 is not the same as 2/10.

Let's make a new grammar with the new operators:

mulop ::= * | /
addop ::= + | -
mulexp ::= <value> {<mulop> <value>}
addexp ::= <mulexp> {<addop> <mulexp> }
expression ::= <addexp>

As you can see, + and - have the same precedence, so does * and /.

Make a copy of mul.pb (call it subdiv.pb). Adding subtraction to the code is done like this. I also replaced all occurences of ecx with ebp to make the generated assembly code easier to read. Ecx looks too much like eax. Contrary to what some people believe, ebp is not some kind of special register. It is a general-purpose register just like ecx.

Procedure AddExp()
  ; addexp     ::= <mulexp> {<addop> <mulexp> }
  Protected Operation.c
  MulExp()
  While Look = '+' Or Look = '-'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    MulExp()
    Emit("pop  ebp")
    Select Operation
      Case '+': Emit("add  eax, ebp")
      Case '-': Emit("sub  eax, ebp")
                Emit("neg  eax")
    EndSelect
  Wend
EndProcedure

We store the operator in a variable to preserve it across the calls to GetLookWhite() and Mulexp(). Then we negate the result after subtracting to get the correct sign. We have to do this because we are subtracting ebp from eax instead of the other way around (which is what we really want).
We could replace

sub  eax, ebp
neg  eax

with

sub  ebp, eax
mov  eax, ebp

but simply negating the result is faster.

Try the code now. Mix addition, subtraction and multiplication, and test the resulting code to see if it's correct.

Now for division. Here is the code:

Procedure MulExp()
  ; mulexp     ::= <value> {<mulop> <value>}
  Protected Operation.c
  Value()
  While Look = '*' Or Look = '/'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    Value()
    Emit("pop  ebp")
    Select Operation
      Case '*': Emit("imul eax, ebp")
      Case '/': Emit("xchg eax, ebp")
                Emit("cdq") ; expands eax to edx:eax
                Emit("idiv ebp")
    EndSelect
  Wend
EndProcedure

With division, we also get the operands in the reverse order. However, we can't just negate the result. While -(10-2) equals 2-10, -(10/2) does not equal 2/10. So we have to exchange the operands before doing the division. We do that with the xchg instruction.
I should mention here that you must never use the xchg instruction with a memory operand (unless you are coding a mutex). This is because xchg with memory "asserts a bus lock". This just means it's very slow. So multiple mov instructions are faster. But with registers only, it's fine.
The cdq (convert dword to quad) instruction makes a quad out of the value stored in eax. Or technically, it sign extends eax to edx:eax. Which becomes the input of the idiv instruction.
If you ever find yourself doing idiv edx, you're doing it wrong. Because edx:eax is already the input, so you'll get a nonsense result.
Finally, idiv ebp divides edx:eax by ebp and leaves the result in eax. Also, it leaves the remainder of the division in edx. (Example: the remainder of 3/2 is 1. And of 5/2, also 1.)

The operator for getting the remainder is called modulo, and is usually % or mod. PB uses %, so we'll do that, too.

Here is the full code, including addition, subtraction, multiplication, division and modulo:

; subdiv.pb
IncludeFile "..\base2.pb"

Procedure Value()
  ; value  ::= <integer>
  Emit("mov  eax, " + GetInteger())
EndProcedure

Procedure MulExp()
  ; mulexp     ::= <value> {<mulop> <value>}
  Protected Operation.c
  Value()
  While Look = '*' Or Look = '/' Or Look = '%'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    Value()
    Emit("pop  ebp")
    Select Operation
      Case '*': Emit("imul eax, ebp")
      Case '/', '%'
                Emit("xchg eax, ebp")
                Emit("cdq") ; expands eax to edx:eax
                Emit("idiv ebp")
    EndSelect
    If Operation = '%'
      Emit("mov  eax, edx")
    EndIf
  Wend
EndProcedure

Procedure AddExp()
  ; addexp     ::= <mulexp> {<addop> <mulexp> }
  Protected Operation.c
  MulExp()
  While Look = '+' Or Look = '-'
    Operation = Look
    GetLookWhite()
    Emit("push eax")
    MulExp()
    Emit("pop  ebp")
    Select Operation
      Case '+': Emit("add  eax, ebp")
      Case '-': Emit("sub  eax, ebp")
                Emit("neg  eax")
    EndSelect
  Wend
EndProcedure

Procedure Expression()
  ; expression ::= <AddExp>
  AddExp()
  
  If Look <> 0
    Expected("End of line")
  EndIf
EndProcedure

Init()
Expression()
Input()

6. Multiplication

With the ease with which we eased us through addition, one could be lead to believe that multiplication was equally easy. After all, multiplication is almost the same as addition, a number on each side of an operator (4*5) and a cpu instruction to calculate the result. But not so fast. The difficulty arises when multiplication and addition are combined.
Compute 5+5*3. No, it's not 30, it's 20. That's the difficult part: multiplication must be done before addition. A general term for this is operator precedence. Multiplication has a higher precedence than addition, so it must be computed first. Thus, 5+5*3 is equal to 5+(5*3) rather than (5+5)*3.

Let's try to write a BNF for this, so that we have some guidance when implementing it. Here is our old BNF (for addition only)

addexp     ::= <value> {+ <value>}
expression ::= <addexp>

Here it is with support for multiplication:

mulexp     ::= <value> {* <value>}
addexp     ::= <mulexp> {+ <mulexp> }
expression ::= <addexp>

This may be seem strange. Why does it work? Let's look at 1+5*3. First expression calls addexp. Then addexp calls mulexp. And mulexp calls value, which loads 1 into eax. Control is returned to mulexp, which sees that Look is not *. Control is returned to addexp, which sees the + sign and pushes eax.
Everything is working normally so far (except that we did an unnecessary roundtrip through mulexp). Now comes the clever part. We are still in addexp, and 1 is pushed on the stack. Instead of calling value directly to get the next number, addexp calls mulexp, which in turn calls value to get the 5 into eax.
Now mulexp sees a *, so it pushes the 5 onto the stack, calls value again to get 3 in eax, and multiplies eax with the 5. So we now have the result of 3*5 in eax and 1 on the stack. (mulexp popped the 5 when multiplying.)
Finally we return addexp again, which adds the 1 from the stack to eax (15), to get the final result, 16.

The bottom line is: we multiplied before adding. You should read the description above again while looking at the actual code. If you need to, use the debugger to single-step through the code with the description above as an assistance to see what's going on.

The actual code is straightforward. AddExp() is modified to call MulExp() instead of Value().
MulExp() is written like the old AddExp(), but checks for the multiplication operator (*) instead of the addition operator (+) and, of course, issues an imul cpu instruction instead of an add instruction.

I recommend that you look here: http://flatassembler.net/docs.php?article=manual to get a description of all x86 assembly instructions. You can also download the same file as a pdf file.

In case you were wondering why we are using imul instead of just mul, it's because imul is for signed multiplication while mul is for unsigned for multiplication.

; mul.pb
IncludeFile "..\base2.pb"

Procedure Value()
  ; value  ::= <integer>
  Emit("mov  eax, " + GetInteger())
EndProcedure

Procedure MulExp()
  ; mulexp     ::= <value> {* <value>}
  Value()
  While Look = '*'
    GetLookWhite()
    Emit("push eax")
    Value()
    Emit("pop  ecx")
    Emit("imul eax, ecx")
  Wend
EndProcedure

Procedure AddExp()
  ; addexp     ::= <mulexp> {+ <mulexp> }
  MulExp()
  While Look = '+'
    GetLookWhite()
    Emit("push eax")
    MulExp()
    Emit("pop  ecx")
    Emit("add  eax, ecx")
  Wend
EndProcedure

Procedure Expression()
  ; expression ::= <AddExp>
  AddExp()
  
  If Look <> 0
    Expected("End of line")
  EndIf
EndProcedure

Init()
Expression()
Input()

Try to run it. Test it with some expressions and see if the generated code is correct (see the end of the tutorial "addition" for a how-to). Run it with single-stepping (see the PB help file about the debugger) to understand what's really going on in there.

If the program above works, and you understand why/how it works, then you have just completed the hardest part of writing a compiler.;
Once you grasp BNF and parsing you can do almost anything you want to (except generate efficient code, which is a problem all by itself). Mind you, there's a lot of work left, but when you understand this program you have passed a hurdle that enables you to learn the more advanced stuff. It's like creating a new space in your mind. Sort of being born. It doesn't automatically make you a great programmer, but it's the biggest hurdle.

Oups. I likened mastering multiplication to being born. What an unsuitable pun...

I repeat: You have just completed the hardest part of writing a compiler.

5. Base 2

If you think I'm talking about baseball here, you're way off base. (That was a baseball expression. Not the kind of expressions we are working with.)
What we need is a new base.pb with more functionality. Make a copy of base.pb and name it base2.pb.

We want to add some new stuff. Here is a list:
- Move IsDigit() and GetDigit() into base2.pb
- Add support for integers consisting of multiple digits (like 128)
- Add support for names (IsAlpha(), IsAlphaNumeric() and GetName())
- Add support for whitespace

You should try to write these procedures yourself. But not without some guidance. Here's what you do:

- Copy IsDigit() from add2.pb into base2.pb

- Write a new procedure IsWhite(). BNF for whitespace:

white ::= #SPACE | #TAB

Return 1 if yes, 0 else.

- Write a procedure EatWhite() which calls GetLook() while Look is whitespace.

- Write GetLookWhite() which calls GetLook() and then EatWhite().

- Write a new procedure: GetInteger() to replace GetDigit(). Here is the BNF grammar for what input it should accept:

integer ::= <digit> {<digit>}

The integer should be returned as a string. If you're into error handling, give an error if the number is outside the range of a 32-bit signed integer. Be sure to give an error if Look is not a digit when the procedure is called (because this is REQUIRED, see the BNF).
Since we want support for whitespace after numbers (so you can type 1+ 2 instead of 1+2) call EatWhite() at the end.

- Write a new procedure IsAlpha(C.c) that returns 1 if the parameter is an alphabetic character. Here is the BNF:

alpha ::= A..Z | _ | $

As you can see, I included underline and $. You may or may not want to do that.

- Write a new procedure IsAlphaNumeric(). Use IsAlpha() and IsDigit() to prevent doing the same work twice.

- Write a new procedure GetName() that returns a name as a string. If you ever heard the word "identifier", this is the same, only a much longer word. Grammar:

name ::= <alpha> {<digit> | <alpha>}

Notice that a name must start with an alpha (defined above), but subsequent characters may be alphanumeric.
Call EatWhite() at the end to allow for program whitespace.

Test all your procedures to make sure they do what you expect.

Here's my new base2.pb. You should compare your procedures with mine to be sure they work the same:

Global Look.c
Global Instream.s

Procedure GetLook()
  Look = Asc(Instream)
  Instream = Mid(Instream, 2)
EndProcedure

Procedure Error(E.s)
  PrintN("Error: " + E)
  Input()
  End
EndProcedure

Procedure Expected(S.s)
  Error(S + " expected.")
EndProcedure

Procedure Emit(S.s)
  PrintN(#TAB$ + S)
EndProcedure

Procedure Init()
  OpenConsole()
  Instream = Input()
  GetLook()
EndProcedure

;- Additions:

Procedure IsDigit(C.c)
  If C >= '0' And C <= '9'
    ProcedureReturn 1
  EndIf
EndProcedure

Procedure IsAlpha(C.c)
  Select C
    Case 'a' To 'z', 'A' To 'Z', '_', '$'
      ProcedureReturn 1
  EndSelect
EndProcedure

Procedure IsAlphaNumeric(C.c)
  If IsAlpha(C) Or IsDigit(C)
    ProcedureReturn 1
  EndIf
EndProcedure

Procedure IsWhite(C.c)
  If C = ' ' Or C = #TAB
    ProcedureReturn 1
  EndIf
EndProcedure

Procedure EatWhite()
  While IsWhite(Look)
    GetLook()
  Wend
EndProcedure

Procedure.s GetInteger()
  Protected Name.s
  While IsDigit(Look)
    Name + Chr(Look)
    GetLook()
  Wend
  If Not Name
    Expected("Integer")
  EndIf
  EatWhite()
  ProcedureReturn Name
EndProcedure

Procedure.s GetName()
  Protected Name.s
  While IsAlphaNumeric(Look)
    Name + Chr(Look)
    GetLook()
  Wend
  If Not Name
    Expected("Name")
  EndIf
  EatWhite()
  ProcedureReturn Name
EndProcedure

Procedure GetLookWhite()
  GetLook()
  EatWhite()
EndProcedure

With so much new features it would be sad not to use them right away. So make a copy of add2.pb (call it add2_base2.pb).
You only need to make three changes to it. The first is to include base2.pb instead of base.pb. Then change GetDigit() into GetInteger() and GetLook() in AddExp() into GetLookWhite().
The result: we now handle any integers, including arbitary whitespace. Try, for example,

"128 +    56+8"

Here it is:

; add2_base2.pb
IncludeFile "..\base2.pb" ; Changed from base.pb
Init()

Procedure Value()
  ; value  ::= <integer>
  Emit("mov  eax, " + GetInteger()) ; Changed from GetDigit()
EndProcedure

Procedure AddExp()
  ; addexp  ::= <value> {+ <value>}
  Value()
  While Look = '+'
    GetLookWhite() ; Changed from GetLook()
    Emit("push eax")
    Value()
    Emit("pop  ecx")
    Emit("add  eax, ecx")
  Wend
EndProcedure

Procedure Expression()
  ; expression  ::= <addexp>
  AddExp()
  
  If Look <> 0
    Expected("End of line")
  EndIf
EndProcedure

; Parse an expression
Expression()
Input()

4. Addition

We can now have an expression that consists of a single number. If you don't remember, our definition of an expression was

expression ::= <digit>

The next step is to add a + operator. We may write a BNF like this:

digit  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
expression ::= <digit> + <digit>

Unfortunately, this means that an expression can't be just one digit any more. It must always be a digit plus another digit. So let's use this BNF instead, it allows a digit and optionally, a plus sign and another digit. The [] symbol means whatever is inside the brackets is optional.

digit  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
expression ::= <digit> [+ <digit>]

Let's make a copy of the file from the previous chapter (name it add.pb) and extend it to accept addition. To do that, all we need to change is the procedure Expression(), since we only changed that rule in our grammar.

Procedure Expression()
  ; Here's the BNF to look at while writing the procedure:
  ; expression ::= <digit> [+ <digit>]
  Emit("mov  eax, " + GetDigit())
  
  If Look = '+' ; remember, the + is optional
    GetLook() ; Throw away the + and get the next character
    Emit("add  eax, " + GetDigit())
  EndIf
  
  If Look <> 0 ; The grammar does not allow anything after the digit
    Expected("End of line") ; so if there is anything, put an error
  EndIf
EndProcedure

Try it. This program should accept inputs simiar to this:
1
7+3
And refuse all other inputs, like this:
7+3+5
Var+4

You probably don't want to be limited to expression with just two terms, so let's find a way to extend our parser to accept any number of additions in the same expression.

expression ::= <digit> {+ <digit>}

The { and } means whatever is inside is optional and could be repeated any number of times. Thus, legal expressions will now be:
1
1+2
1+2+3+4+5+6+7+8+1+2+6+7
This is exactly what we want.

Before we go overboard and implement this, I'd like to make a few changes to make the program more modular:

value  ::= <digit>
addexp  ::= <value> {+ <value>}
expression ::= <addexp>

Later on, when we need other values than digits, like variables, we don't need to change addexp or expression, only value.

Here is the program:

; add2.pb
IncludeFile "..\base.pb"
Init()

Procedure IsDigit(C.c)
  If C >= '0' And C <= '9'
    ProcedureReturn 1
  EndIf
EndProcedure

Procedure.s GetDigit()
  If IsDigit(Look)
    Temp.s = Chr(Look)
    GetLook()
    ProcedureReturn Temp
  Else
    Expected("Digit")
  EndIf
EndProcedure

Procedure Value()
  ; value  ::= <digit>
  Emit("mov  eax, " + GetDigit())
EndProcedure

Procedure AddExp()
  ; addexp  ::= <value> {+ <value>}
  Value()
  While Look = '+'
    GetLook()
    Emit("push eax")
    Value()
    Emit("pop  ecx")
    Emit("add  eax, ecx")
  Wend
EndProcedure


Procedure Expression()
  ; expression  ::= <addexp>
  AddExp()
  
  If Look <> 0
    Expected("End of line")
  EndIf
EndProcedure

; Parse an expression
Expression()
Input()

As you can see, Value() always moves the value into eax. Then AddExpr() needs to save that value on the stack so that it doesn't get overwritten by the next call to Value(). So our generated code for the expression "1+2" just got worse by a huge amount. Previously just two instructions, now it's five!
You just have to get used to this. It is possible to generate better code, but it is harder. For beginners, this will do. Remember, it's probably way faster than a normal interpreter would be.
Try the program with a few expressions and see if it generates valid code. That is the most important of all. If the generated code doesn’t work correctly, it doesn't matter it it's really, really fast.

To check that a generated code produces the correct result, you can copy and paste it into purebasic, enable inline assembly, and run it with some code to display the result. Here is an example:

; Show the correct result
; (hopefully PB gets it right!)
Debug 1+2+3+4

; Generated code:
MOV  eax, 1
PUSH eax
MOV  eax, 2
POP  ecx
ADD  eax, ecx
PUSH eax
MOV  eax, 3
POP  ecx
ADD  eax, ecx
PUSH eax
MOV  eax, 4
POP  ecx
ADD  eax, ecx

; Get result
Show.l
MOV  [v_Show], eax
Debug Show

If you have in front of you a program that produces valid x86 code from the entered expressions: congratulations! If not, then be ashamed, you brickhead! (Or rather, I didn't explain everything properly. Please post a comment to get help.)

29 August 2009

3. Expressions: a digit

Most programming languages have a concept of expressions. For our purpose, an expression is something which can be assigned to a variable.
Some example expressions:


(R*Alpha)/255) + ((R2*(255-Alpha)) / 255
0
SingleVariable
5+IWantCake
17*85 - 5/2

To write a compiler for a procedural language you need to be able to compile an expression. Since they are so important, we will start by dealing with them.

The first thing we need to do is to extract "meaning" from the input characters. The input characters in basetest.pb was all treated in the same way (printed to output). When compiling an expression, this is not so. The different characters have different meaning. For example, + means to add something, while 1, 45 and 153.56 are numbers.

"Making sense of" or "extracting meaning out of" a long list of characters is called parsing. We want to parse an expression. So how do we do that? We start with the simplest case possible and build to it.

The simplest expression is a single digit. To recognize a single digit in the input stream of characters we should have a procedure to check if a character is a digit or if it is something else entirely. One could write it like this:

Procedure IsDigit(C.c)
  If C >= '0' And C <= '9'
    ProcedureReturn 1
  Else
    ProcedureReturn 0
  EndIf
EndProcedure

If C is a digit, it returns 1, else 0.

Or less verbosely, like this:

Procedure IsDigit(C.c)
  If C >= '0' And C <= '9'
    ProcedureReturn 1
  EndIf
EndProcedure

(PureBasic procedures returns 0 when they reach the end.)

Or even like this:

Procedure IsDigit(C.c)
  Select C
    Case '0' To '9'
      ProcedureReturn 1
  EndSelect
EndProcedure

The important part is: does it work? We must always check that!

Debug "Not digits:"
Debug IsDigit('a')
Debug IsDigit('b')
Debug IsDigit('C')
Debug IsDigit('?')
Debug IsDigit('!')
Debug "Digits:"
Debug IsDigit('0')
Debug IsDigit('4')
Debug IsDigit('9')

Recognizing things like numbers without making sense of them (we know it's a number, but we don't know if it's part of an assignment or part of an array declaration) is called lexing or scanning, depending on who you ask.

So, there was three different procedures for recognizing a digit, yet all of them did the same. With another programming language, like C, the procedures would look even more different. So is there a simple way of specifying what we want (a digit) in an implementation-independent way? Yes: BNF, or Backus-Naur Form.

Instead of defining what is a digit by writing a procedure to recognize it, we can use a statement in BNF to say what is a digit:

digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

This reads "digit has the form 0 or 1 or 2…" and so on. As you can see, this is shorter and clearer than either of the procedures. That is why we use BNF: it allows us to precisely specify what we want in a concise and implementation-independent manner. No matter which of the above procedures you use, it will use the definition that a digit is a character between 1 and 9.

Now that we have a way to recognize digits, let's write a program that requires that the first character of the input is a digit.

; new folder\readnum1.pb
IncludeFile "..\base.pb"
Init()

Procedure IsDigit(C.c)
  If C >= '0' And C <= '9'
    ProcedureReturn 1
  EndIf
EndProcedure

Procedure Digit()
  If IsDigit(Look)
    Emit("found digit!")
  Else
    Expected("digit")
  EndIf
EndProcedure

Digit()
Input() ; Wait for enter so we can see the output

Try it with different input and see if it reacts correctly. If not, then go back and check what is wrong.

It's nice, but it only tells us we got a digit, not which one! Not very useful, actually. Let's write a procedure that actually gives us the digit.

Procedure.s GetDigit()
  If IsDigit(Look)
    Temp.s = Chr(Look)
    GetLook()
    ProcedureReturn Temp
  Else
    Expected("Digit")
  EndIf
EndProcedure

As you can see, we have error handling here. Catching errors early on will simplify everything.

Remember why we wanted to parse a digit? It was because we really wanted to parse an expression. Let's write the grammar (in BNF) for an expression which can only consist of one thing: a single digit.

digit  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
expression ::= <digit>

is inside clips to signify that we reference the rule digit (0 | 1 | 2 … 8 | 9) rather than wanting the actual characters "digit". An important side effect of this way of defining an expression is that it can never empty, nor can it be multiple digits. It can only be exactly one digit.

Now we are ready to write a program which implements the grammar from above:

IncludeFile "..\base.pb"
Init()

Procedure IsDigit(C.c)
  If C >= '0' And C <= '9'
    ProcedureReturn 1
  EndIf
EndProcedure

Procedure.s GetDigit()
  If IsDigit(Look)
    Temp.s = Chr(Look)
    GetLook()
    ProcedureReturn Temp
  Else
    Expected("Digit")
  EndIf
EndProcedure

Procedure Expression()
  Emit("mov  eax, " + GetDigit())
  If Look <> 0 ; The grammar does not allow anything after the digit
    Expected("End of line") ; so if there is anything, put an error
  EndIf
EndProcedure

; Parse an expression
Expression()
Input()

Try it with various inputs and see how it:
- Refuses the empty input
- Refuses any letter or special symbol as input
- Gives an error when the input contains more than one digit
- Accepts the input without error ONLY when there is EXACTLY one digit
This is exactly as specified in the BNF grammar. I'll repeat it here so you can see:

digit  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
expression ::= <digit>

Also, this programs generates actual, valid assembly code (for 32-bit x86). That is just so wow!

2. Testing base.pb

Let's see what base.pb can do for us. Create a new folder ("testbase") and make a new purebasic source file in it. Then write this into the .pb file:

IncludeFile "..\base.pb" ; Include the base file from the parent folder

Init() ; Init input by reading a line from the console
Global Buffer.s ; A temporary string used by this example

While Look ; Look is the current character we are working with
  ; Run this loop until Look becomes 0
  Buffer + Chr(Look) ; Put Look (converted to a string) at the end of our string
  GetLook() ; Discard the current value in Look and read in a new character
Wend

; Now our Buffer will be filled with all the characters read into Look

Emit(Buffer) ; Let's show it just to be sure

Input() ; Wait for enter so we can the output

When you run it, you should see a console window. When you type in some text and hit enter, what you should typed should come back at you, only indented. Read this program (and if necessary, base.pb) thoroughly to understand what's going on.

1. Base

Some people wanted to write a compiler in PureBasic, but didn't know how to. So hence I started this tutorial to help people get started. Once you grasp the basics you can expand on your own.

This tutorial is in many ways similar to "Let's write a compiler" by Jack Crenshaw, because that's the tutorial I read to learn this. Now come sue me.

A compiler reads input text (called the source language) and turns that into assembly language (called the target language). Since we need to experiment a lot to understand what's going it would be wise to have a set of common input and output routines. I already made some for you. Unlike a real compiler, which will read from a file and write to another file, these procedures read from the console and write to the console. This makes them suitable for experimenting. Later we will see that retargeting the procedures to read and write files will be simple.

Save this as base.pb. Then see if it compiles and runs (it should do nothing).

Global Look.c
Global Instream.s

Procedure GetLook()
  Look = Asc(Instream)
  Instream = Mid(Instream, 2)
EndProcedure

Procedure Error(E.s)
  PrintN("Error: " + E)
  Input()
  End
EndProcedure

Procedure Expected(S.s)
  Error(S + " expected.")
EndProcedure

Procedure Emit(S.s)
  PrintN(#TAB$ + S)
EndProcedure

Procedure Init()
  Instream = Input()
  GetLook()
EndProcedure