29 August 2009

3. Expressions: a digit

Most programming languages have a concept of expressions. For our purpose, an expression is something which can be assigned to a variable.
Some example expressions:

(R*Alpha)/255) + ((R2*(255-Alpha)) / 255
0
SingleVariable
5+IWantCake
17*85 - 5/2


To write a compiler for a procedural language you need to be able to compile an expression. Since they are so important, we will start by dealing with them.

The first thing we need to do is to extract "meaning" from the input characters. The input characters in basetest.pb was all treated in the same way (printed to output). When compiling an expression, this is not so. The different characters have different meaning. For example, + means to add something, while 1, 45 and 153.56 are numbers.

"Making sense of" or "extracting meaning out of" a long list of characters is called parsing. We want to parse an expression. So how do we do that? We start with the simplest case possible and build to it.

The simplest expression is a single digit. To recognize a single digit in the input stream of characters we should have a procedure to check if a character is a digit or if it is something else entirely. One could write it like this:
Procedure IsDigit(C.c)
If C >= '0' And C <= '9'
ProcedureReturn 1
Else
ProcedureReturn 0
EndIf
EndProcedure


If C is a digit, it returns 1, else 0.

Or less verbosely, like this:
Procedure IsDigit(C.c)
If C >= '0' And C <= '9'
ProcedureReturn 1
EndIf
EndProcedure

(PureBasic procedures returns 0 when they reach the end.)

Or even like this:
Procedure IsDigit(C.c)
Select C
Case '0' To '9'
ProcedureReturn 1
EndSelect
EndProcedure

The important part is: does it work? We must always check that!
Debug "Not digits:"
Debug IsDigit('a')
Debug IsDigit('b')
Debug IsDigit('C')
Debug IsDigit('?')
Debug IsDigit('!')
Debug "Digits:"
Debug IsDigit('0')
Debug IsDigit('4')
Debug IsDigit('9')


Recognizing things like numbers without making sense of them (we know it's a number, but we don't know if it's part of an assignment or part of an array declaration) is called lexing or scanning, depending on who you ask.

So, there was three different procedures for recognizing a digit, yet all of them did the same. With another programming language, like C, the procedures would look even more different. So is there a simple way of specifying what we want (a digit) in an implementation-independent way? Yes: BNF, or Backus-Naur Form.

Instead of defining what is a digit by writing a procedure to recognize it, we can use a statement in BNF to say what is a digit:

digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9


This reads "digit has the form 0 or 1 or 2…" and so on. As you can see, this is shorter and clearer than either of the procedures. That is why we use BNF: it allows us to precisely specify what we want in a concise and implementation-independent manner. No matter which of the above procedures you use, it will use the definition that a digit is a character between 1 and 9.

Now that we have a way to recognize digits, let's write a program that requires that the first character of the input is a digit.

; new folder\readnum1.pb
IncludeFile "..\base.pb"
Init()

Procedure IsDigit(C.c)
If C >= '0' And C <= '9'
ProcedureReturn 1
EndIf
EndProcedure

Procedure Digit()
If IsDigit(Look)
Emit("found digit!")
Else
Expected("digit")
EndIf
EndProcedure

Digit()
Input() ; Wait for enter so we can see the output


Try it with different input and see if it reacts correctly. If not, then go back and check what is wrong.

It's nice, but it only tells us we got a digit, not which one! Not very useful, actually. Let's write a procedure that actually gives us the digit.

Procedure.s GetDigit()
If IsDigit(Look)
Temp.s = Chr(Look)
GetLook()
ProcedureReturn Temp
Else
Expected("Digit")
EndIf
EndProcedure


As you can see, we have error handling here. Catching errors early on will simplify everything.

Remember why we wanted to parse a digit? It was because we really wanted to parse an expression. Let's write the grammar (in BNF) for an expression which can only consist of one thing: a single digit.

digit  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
expression ::= <digit>


is inside clips to signify that we reference the rule digit (0 | 1 | 2 … 8 | 9) rather than wanting the actual characters "digit". An important side effect of this way of defining an expression is that it can never empty, nor can it be multiple digits. It can only be exactly one digit.

Now we are ready to write a program which implements the grammar from above:

IncludeFile "..\base.pb"
Init()

Procedure IsDigit(C.c)
If C >= '0' And C <= '9'
ProcedureReturn 1
EndIf
EndProcedure

Procedure.s GetDigit()
If IsDigit(Look)
Temp.s = Chr(Look)
GetLook()
ProcedureReturn Temp
Else
Expected("Digit")
EndIf
EndProcedure

Procedure Expression()
Emit("mov eax, " + GetDigit())
If Look <> 0 ; The grammar does not allow anything after the digit
Expected("End of line") ; so if there is anything, put an error
EndIf
EndProcedure

; Parse an expression
Expression()
Input()


Try it with various inputs and see how it:
- Refuses the empty input
- Refuses any letter or special symbol as input
- Gives an error when the input contains more than one digit
- Accepts the input without error ONLY when there is EXACTLY one digit
This is exactly as specified in the BNF grammar. I'll repeat it here so you can see:

digit  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
expression ::= <digit>


Also, this programs generates actual, valid assembly code (for 32-bit x86). That is just so wow!

No comments: