RTSync - the Compiler

The compiler works by constructing a large tree from the source code. The tree is created based on the grammar of the language, specified in the parser.yacc file and the scanner.lex file.

The scanner.lex file contains information on which keywords belong to the language and what constitutes an integer, string or float literal, or a valid variable name. The comment syntax for RTSync is also described in this file. This information has syntax for the LEX lexical analyzer generator (website). This program takes the code in scanner.lex (or whatever other file is specified) and generates code for a lexical analyzer. The lexical analyzer code is currently outputted to a file called lex.yy.c. This code is used to turn the code into a series of tokens that are then read by the parser.

The parser is described in the parser.yacc file. This file is read by the YACC program (website) which generates a tree based on the tokens sent from the lexical analyzer. Each node in this tree contains a certain block of the code provided. The children of each node then contain a portion of this certain block. An example of this is an actor block contains all the code for an actor and the children of the actor node are the various aspects of the actor, such as the variable declarations, init block and method declarations. The correct syntax for the actor node is defined in the following syntax:

actor_block	:  ACTOR ID BRACE_OPEN declaration_list BRACE_CLOSE
		   { $$ = new actor_node(lineno, $2, $4); }
		;

All types of blocks are declared in such a way. Anything that is in all capital letters is defined in the scanner.lex file. This actor_block node requires a syntax that first has the “actor” keyword, followed by a valid identifier, then an opening curly brace ( { ). The declaration_list block is then defined elsewhere in parser.yacc. This block includes the variable declarations, etc. Finally this block is followed by a closing curly brace ( } ). In the C++ code generated by YACC, the actor_block is represented by an actor_node object. The constructor for this object will have three parameters: the line number where the actor is first declared, the identifier and the object returned by the declaration_list block. The $$ is the return value of the actor_block. The $2 and $4 represent the second item and fourth item of the syntax list, respectively (second being the identifier and fourth being the declaration_list). The semicolon terminates the actor_block definition. Note that in the YACC grammar, whitespace is mostly ignored (except, of course, the spaces in between different words).

There are several blocks that include an OR operator ( | ) in their definitions. An example of this is the module block. A module block is either an actor, a synchronizer, a global synchronizer or something else. The definition for this block looks like this:

module	:  actor_block
	   { $$ = $1; }
	|  sync_block
	   { $$ = $1; }
	|  global_sync
	   { $$ = $1; }
	|  PASS
	   { $$ = new pass_node(lineno, $1); }
	;

What this does is first checks the code in the module block to see if it is an actor, if not it checks if it is a synchronizer, then a global synchronizer. If it is none of these, the PASS keyword matches anything else that may be put here and then creates a pass_node object. Semantically, the pass_node object contains anything that is not RTSync code and will treat the code as standard C++ code. In cases where there are several different OR statements, but none of them are matched by the code in the block, the compiler outputs a syntax error.

Once the lexical analyzer and parser are run, the code is now structured as a tree. The tree is then passed to the various visitors which scan over the tree. There are two visitors which scan the entire tree: the code visitor and the scope visitor. The code visitor is the first to run, it traverses the tree and converts the RTSync code into C++ code, outputting all the code to handle the initalization of the postmaster, connecting to the distributed system, etc. The second to run is the scope visitor, which checks to see if there are scope violations or not in the code, or type violations when assigning different values, etc.

The code visitor has a visit() method for each type of node generated by the parser. These methods handle the semantics of each node. Each node has an accept() method that handles its processing. When running the visit() method for each node, the child nodes can be then processed by calling the accept() method of the child. Then for the various list type nodes (like variable declaration lists which have an arbitrary number of nodes) are treated like a linked list, where each of them have a head and a tail. The head is another list, and the tail is an element of the list. The code visitor handles the outputting of the C++ code. The C++ code is outputted to the file specified on the command line. The default filename is “rt_output.cc” if no output filename is specified.

The scope visitor has the same structure as the code visitor, but checks for different things. The scope visitor has a list of symbol tables that keeps track of the variables declared in each scope. The list works as a stack, with each element being a symbol table that maps symbol names to a declarable object. When a new scope is entered, a new table is pushed onto the stack and when the scope is left, the top table is popped off the stack. If an identifier is referenced that is not declared in the RTSync scope, a warning message is displayed to the user. The compiler does not fail because it is possible that this identifier is declared outside of the RTSync scope and is a C++ global variable or function. It is not critical for the RTSync compiler to fail on this error, since the error will be caught by the C++ compiler if the identifier is not declared at all.

The other responsibility of the scope visitor is to type-check expressions. Each expression has a type that is calculated depending on what is in the expression. If there is a mix of incompatible types then the compiler prints out an error and compilation fails. There is a function that compares two types