03 October 2009

14. Symbol Table

Edit: This turned out to need some adjustments. Get the correct version here: http://pbtut.blogspot.com/2010/01/listing-symboltablepb.html

The compiler needs to have a list of all the symbols used in the program. This includes global variables, local variables, functions, and similar.

A variable inside a function may have the same name as a global variable. A part of the program where variables must be unique is called a scope. In most programming languages nesting of scopes is allowed. In PureBasic, one level of nesting is allowed: You can have local scopes (in procedures) __inside__ the global scope (outside procedures). In Pascal scopes can be arbitarily nested. First, each file ("unit") has its own scope, and this scope may contain procedures, and procedures may actually contain other, local, procedures.

To allow for this nesting, symbol tables are usually best implemented with a tree structure. For simplicity, we won't do that. We will simply shove all symbols into a hash map. To distinguish between scopes, we will put the name of the containing scope in front of the symbol name, separated by a ":". Here's an example:
Global MyGlobalVariable, RedVariable
Procedure NewScopeInHere()
RedVariable = 1 ; (global)
Protected MyLocalVariable, RedVariable
RedVariable = 2 ; (local)
EndProcedure
The symbol table will have these entries:
MyGlobalVariable
RedVariable
NewScopeInHere
NewScopeInHere:MyLocalVariable
NewScopeInHere:RedVariable

You can see that even though we have two variables named RedVariable, there is no problem, because they are in different scopes.

Each entry in our hash table will be a pointer to a structure of either the type SVariable or the type SFunction. However, to know which type we're dealing with, we need a common header, which tells us the type. The common header is the type SSymbol, which we can see here:
Structure SSymbol
Name.s
Kind.i
EndStructure

Name.s is the name without the scope (like "RedVariable" for both those variables). Kind tells us what kind of symbol this (procedure or variable). The possible values are specified in an enumeration:
Enumeration ; Symbol kind
#SKind_Variable
#SKind_Function
EndEnumeration


The following two structures extend SSymbol, and are the ones we actually allocate and use:
Structure SVariable Extends SSymbol
Type.i
RefKind.i ; Global, local, parameter
RefOffset.i
EndStructure

Structure SFunction Extends SSymbol
Type.i ; Return type
Ref.s ; Entry point
ParamCount.i
ParamTypes.i[10] ; We allow at most 10 parameters
EndStructure


Important: I have defined the word "type" to mean datatype. Everywhere where it would be natural to use the word "type" I use the word kind instead.

I won't bother explaining the symbol table too much, it's just ordinary programming. You should read the code thoroughly until you understand it. Here is symboltable.pb: http://pastebin.ca/1590601
Edit: Don't this one, go to the top of this post to find a new version!

To actually use the symboltable we need an example program. I made a new file "symsimple.pb", and here it is: http://pastebin.ca/1590600

It accepts two kind of statements: assignments and variable declarations.
assignment ::= <name> = <expression>
vardecl ::= var <name> as <typename>

The only allowed typename for now is "long". An expression consists of a single value, which can be an integer or a variable.

Changes:
- The new VarRef() does the heavy work of what VarValue() did before, however, it takes a pointer to SVariable. From the rest of the code, simply call VarValue() as before, and it will call VarRef() for you.

- VarValue() has been changed to see if the variable is declared (if it can be looked up from the symbol table, it is declared). If it's declared, it calls VarRef() to do the heavy lifting. Else it shows an error message.

- Statement() distinguishes between two statements, in case of a variable declaration it calls the procedure VariableDeclaration().

- VariableDeclaration() reads the name of the variable, followed by the keyword "as" followed by the type name. It then calls a function in the symboltable.pb to add the variable there. AddVariable() will give an error if this variable was already declared.

- GlobalAsmDefinition() takes two parameters, of which only one is useful to us. It is called once for each entry in the symbol table. If the entry represents a global variable, it emits the asm code to allocate space for this variable.


- Footer() is called after the whole program has been compiled. It calls EnumerateSymbols() with the parameter GlobalAsmDifinition() to enable allocation of space for global variables.
Before that it calls _ExitProcess, else we would start executing the data in the global variables when reaching the end of the program.

A test run of the symsimple.pb could now look like this:
var a as long
var b as long
a = 5
b = a
b = 8

--------------------------
mov eax, 5
mov [v_a], eax
mov eax, [v_a]
mov [v_b], eax
mov eax, 8
mov [v_b], eax
push 0
call _ExitProcess
; Global variables:
v_a dd 0
v_b dd 0


Notice how you get an error message when using undeclared variables, and also when declaring variables twice.