Parsing

Introduction

We'll look at each language element and put together some rules for it. In some cases, I may look at how the same construct would be implemented for other languages, showing some difficulties and how to overcome them. At least that's my plan. We'll see where I end up...

Lexical Elements of XL

Whitespace

XL declares that blank spaces, tabs, and end-of-line constitute whitespace. Oh, and don't forget the ever (un)popular carriage-returns that DOS is so fond of. You should always handle this, especially if you will be running on UNIX and there is even a chance that someone will run a file that has been touched by DOS through your parser. Let's write regular expressions for the whitespace tokens:

#token BLANK   "\ " <<skip();>>

#token TAB     "\t" <<skip();>>

#token EVIL_CR "\r" <<skip();>>

#token NEWLINE "\n" <<skip();>>

           

The #token tag tells DLG that we are defining a new token. Following #token is an identifier that can be used in the ANTLR grammar to say "this type of token comes here." Then comes a regular expression for the token, in double quotes. (See the PCCTS reference for details on regular expression syntax.) Finally, you can add action code inside << and >> This code will be executed as soon as DLG matches the regular expression.

The XL specification says "whitespace is ignored," so we put a call to skip() in the action code for these rules. Because we are ignoring whitespace, we really don't need to assign token identifiers -- the skip() statement tells DLG not to pass the token to the parser. So we can re-write these rules as

#token "\ " <<skip();>>

#token "\t" <<skip();>>

#token "\r" <<skip();>>

#token "\n" <<skip();>>

           

Those of you familiar with regular expression are probably thinking "why does he have these are three separate rules?" And you're right -- using regular expression character classes, we can re-write all of these as

#token "[\ \t\r\n]" <<skip();>>

           

but let's think a bit more on this -- we want to do something with the "newline" character, "\n". We want to tell DLG to increment our current line number, so error messages can be friendlier and tell the user which line the error was found. So we add a call to newline() to the action code for "\n" and end up with

#token "[\ \t\r]" <<skip();>>

#token "\n" <<skip(); newline();>>

           

as our whitespace rules.


Comments

We got lucky in XL, but I'll bore you with C comments in a minute...

XL defines comments as "all text from // to the end of the current line." So, we get an expression like

#token "// ~[\n]* \n" <<skip(); newline();>>

           

A few things to notice about this rule:

I threatened to do C-style comments, so I will. SOMEBODY, PLEASE VERIFY I GOT THIS RIGHT, AS I'M TYPING AT LIGHT SPEED RIGHT NOW WHILE CHANGING A DIAPER...

Let's assume we are using a C compiler that does not keep track of nested comments -- "/*" starts a comment, and "*/" ends a comment. It doesn't matter if it supports the C++ style comments or not -- they'll get gobbled just as well as the rest of a comment.

We can approach this two ways. We can make one evil regular expression for a comment, or use #lexclass to define a separate state that the scanner will enter when it encounters a comment. In my humble opinion (IMHO), the lexclass method is more desirable for the following reasons:

  1. It's more readable -- simpler regular expressions are always easier to figure out when the next person has to maintain your compiler or parser
  2. Comments can get big, and you can overflow your token text buffer very quickly. The lexclass method lets you watch out for this...

First, let's write a rule that finds the start of a C comment. (Note -- these rules will not go into our XL compiler, but I thought you might like to see them anyway...) This rule should appear in your main lexclass (by default, it's called START, which is what I'll use here.)

#lexclass START

#token "/ \*" <<skip(); mode(C_STYLE_COMMENT);>>

           

The order of the skip() and mode() calls in the action code doesn't matter. Under the covers, they just set some flags for DLG. (I recommend trying to be consistent on the order you use, just for clarity.) The regular expression for the comment starter needs to escape the "*" as it is a special character in DLG expressions...

The call to mode(C_STYLE_COMMENT) will cause DLG to look for its next match in the rules defined under #lexclass C_STYLE_COMMENT, so let's look at that definition:

#lexclass C_STYLE_COMMENT

#token "\* /" <<skip(); mode(START);>>

#token "~[]" <<skip();>>

           

These two rules say the following:

  1. If I see "*/", skip it (don't pass a token to the parser) and switch back to my START mode -- we are done with the comment.
  2. If I see any other character, skip it and stay in comment mode.

There is a big problem here. We ignored "\n" "But", someone may say, "we already defined a rule for it above." Sorry -- because we are in a different lexclass, DLG will not match the previous "\n" rule -- it will only match rules found in the current lexclass. So we need to modify the lexclass C_STYLE_COMMENT to look like:

#lexclass C_STYLE_COMMENT

#token "\* /" <<skip(); mode(START);>>

#token "\n" <<skip(); newline();>>

#token "~[]" <<skip();>>

           

Let's try something else -- suppose you were writing a funky C compiler that allowed nested comments that had to match comment levels. You need to keep track of the comment level before returning. You can do this as follows: First, define a comment_level integer variable in your parser class (we'll see how to do that later.) Then, your lexical rules become:

#lexclass START

#token "/ \*" <<

                  skip(); 

                  mode(C_STYLE_COMMENT); 

                  comment_level=1;

              >>



#lexclass C_STYLE_COMMENT

#token "\* /" <<

                  skip(); 

                  comment_level--; 

                  if (comment_level == 0)

                      mode(START);

              >>

#token "/ \*" <<skip(); comment_level++;>>

#token "\n"   <<skip(); newline();>>

#token "~[]"  <<skip();>>

           

This can also be accomplished by using a mode-stack (see the PCCTS Notes for New Users doc for information on using this.) Perhaps I'll add this to the tutorial at a later date.

Simple enough, eh? The moral of this story is "use lexclasses wisely and they can help you immensely."

Note that in the above lexclasses, we should add checking to see if end-of-file was found inside a comment.


Literals

In XL, there are three types of literals: Integer, Character and String.

Integer literals are simple -- they are just a series of digits:

#token INTLIT "[0-9]+"

           

Note that this token definition is different than our others. First, we added a token identifier, INTLIT. This will be used in our grammar to say "an integer literal comes here." Second, there is no action code -- we are not skipping it, counting a newline, or doing anything else with it. We just associate the sequence of digits with token id INTLIT. That's it. The "+" in the regular expression means "one or more -- at least one!"

Character literals as fairly simple as well:

#token CHARLIT "\' ~[] \'"

           

Basically, an apostrophe followed by any character, followed by another apostrophe. In the grammar, we will use CHARLIT to refer to a character literal.

String literals are a bit trickier, and are probably easiest to implement using a lexclass:

#lexclass START

#token           "\""    <<skip(); mode(STRING);>>



#lexclass STRING

#token           "\"\""  <<more(); replchar('\"');>>

#token BADSTRING "\n"    <<

                             replchar('\0'); 

                             newline(); 

                             mode(START);

                             /* error message */ 

                         >>

#token STRINGLIT "\""    <<

                             // truncate quote

                             replchar('\0');

                             mode(START);

                         >>

#token           "~[]" <<more();>>

           

Lots going on here, right? Not really... Let's break it down:

First, we enter a string when we see a double quote. We have a skip() in the action code so we can strip off the leading quote. The mode(STRING) switches us to STRING mode. So far, so good.

What can we see inside a string? Remember, from the spec, two double-quotes is a literal double quote. So we match them, and replace them with (surprise) a single double-quote character. The call to more() says "DLG -- we have started building a token, but we're not done. Keep appending characters..." The call to replchar replaces the matched text (the two double quotes) with a single double-quote.

But what if we see an end-of-line inside a string? This is illegal, and we'll assume it ends the string. So, we replace it with a null to end the string early, call newline() to count the line (remember, we need to do this no matter where we see "\n"!), switch back to START mode and write an error message. The token id BADSTRING will be passed to the parser as an incomplete string, so we can at least try to continue parsing... Whew!

What if we see a single double-quote in a string? It ends the string! So, we want to strip it, return the STRINGLIT token, and go back to our START state. We can't use skip() to drop the trailing quote, because that would stop us from returning the token to the parser. So, we replace the end quote with a null (\0) to end the string early. Then we switch back to the START state, returning a STRINGLIT token.

Finally, any other character is just apended to the STRINGLIT by issuing a more() call.

We'll address handling the BADSTRING token later...

Note that this assumes a fairly short string -- one that will fit in a token string buffer. A better thing to do would be to check the current buffer length before appending characters to the string, but I'll leave that as an exercise for the masochistic reader. (In general, I'm going to go light on error handling to help clarity, but I'll try to point out where it should be added in places like this.)


Keywords

XL reserves all keywords -- I may add later what to do with unreserved keywords (for evil languages like PL/1 -- hey! I like to program in PL/1, but I'd hate to write that compiler!)

There are several ways to implement keywords. The simplest is to just add them in the DLG specification. A more efficient method would be to use a "perfect hashing function" to differentiate between keywords and identifiers. See the PCCTS Notes for New Users for some information on how to do this. I'll stick with the simple method for now...

Note that you can directly specify keywords in the grammar, but if you do, you cannot use SORCERER to identify tokens. Since I'm heading that way eventually, let's spell 'em out...

This is easy, so I'll just spell it out:

#token PROGRAM   "program"

#token BEGIN     "begin"

#token END       "end"

#token VAR       "var"

#token CONSTANT  "constant"

#token TYPE      "type"

#token RECORD    "record"

#token ARRAY     "array"

#token OF        "of"

#token PROCEDURE "procedure"

#token PUT       "put"

#token GET       "get"

#token SKIPLINE  "skipLine"

#token NEWLINE   "newLine"

#token EXIT      "exit"

#token WHEN      "when"

#token RETURN    "return"

#token IF        "if"

#token THEN      "then"

#token ELSE      "else"

#token ELSIF     "elsif"

#token WHILE     "while"

#token LOOP      "loop"

#token AND       "and"

#token OR        "or"

#token INTEGER   "Integer"

#token BOOLEAN   "Boolean"

           

What would happen in the following token definition (not in XL, but present in Pascal):

#token FILE "file"

           

The problem with that is that FILE is a standard decl in stdio.h. That creates a "re-defined" situation. I like to avoid this by adding "_" or "_kw" to the end of token ids that cause me grief. Keep this in mind when you are writing a parser, even if you use yacc... This got especially bad when I tried making a compiler and using Mircosoft's MFC at the same time. Whew! Nasty conflicts!

Also, I have added pre-defined things like type names (INTEGER, for example) and functions (PUT for example) as keywords. This will make it easier for our small compiler, but not terribly realistic. These really should be defined in the symbol table as "predefined functions" and matched using identifier rules, but we're taking the easy way out for now.

We'll use these identifiers in the grammar where keywords would appear.


Operators

XL defines several operators, and we'll define them as well... Note that several have a "\" in them to escape DLG regular expression metacharacters. Be careful! If you don't escape something, you'll get some weird DLG errors later.

#token DOT        "."

#token BECOMES    ":="

#token COLON      ":"

#token SEMI       ";"

#token COMMA      ","

#token EQUALS     "="

#token LBRACKET   "\["

#token RBRACKET   "\]"

#token DOTDOT     ".."

#token LPAREN     "\("

#token RPAREN     "\)"

#token NOT_EQUALS "/="

#token LT         "<"

#token LTE        "<="

#token GT         ">"

#token GTE        ">="

#token PLUS       "\+"

#token MINUS      "\-"

#token TIMES      "\*"

#token DIV        "/"

#token MOD        "mod"

#token NOT        "not"

           

Identifiers

Finally, we talk about identifiers. Basically, they are any sequence of letters and digits that's left. So we get a regular expression like:

#token IDENT "[a-zA-Z] [a-zA-Z0-9]*"

           

Very simple. Note that there is no action code here, especially not any that would perform symbol table lookup. I'm very adamant in pushing my view that scanner to parser communication should be a one-way street. This is especially important in multiple-lookahead parsers. If your parser decides when to push and pop scope (a likely scenario), and your scanner is reading two or more tokens ahead, you may be looking at a token in the scanner before its proper scope was pushed or popped. In addition, things like semantic predicates let you make more pointed decisions about how to parse identifiers in the grammar, rather than changing the token type in the scanner.


Putting the Lexical Elements Together

What do we have so far? Let's look at it in one lump, shall we?

#lexclass START // Not really necessary, but

                // good commentary nonetheless



// White Space

#token "[\ \t\r]"     <<skip();>>

#token "\n"           <<skip(); newline();>>



// Comments

#token "// ~[\n]* \n" <<skip(); newline();>>



// Literals

#token INTLIT  "[0-9]+"

#token CHARLIT "\' ~[] \'"

#token         "\"" << // start STRINGLIT

                        skip();

                        mode(STRING);

                    >>



// Keywords

#token PROGRAM   "program"

#token BEGIN     "begin"

#token END       "end"

#token VAR       "var"

#token CONSTANT  "constant"

#token TYPE      "type"

#token RECORD    "record"

#token ARRAY     "array"

#token OF        "of"

#token PROCEDURE "procedure"

#token PUT       "put"

#token GET       "get"

#token SKIPLINE  "skipLine"

#token NEWLINE   "newLine"

#token EXIT      "exit"

#token WHEN      "when"

#token RETURN    "return"

#token IF        "if"

#token THEN      "then"

#token ELSE      "else"

#token ELSIF     "elsif"

#token WHILE     "while"

#token LOOP      "loop"

#token AND       "and"

#token OR        "or"

#token INTEGER  "Integer"

#token BOOLEAN "Boolean" 



// Operators

#token DOT        "."

#token BECOMES    ":="

#token COLON      ":"

#token SEMI       ";"

#token COMMA      ","

#token EQUALS     "="

#token LBRACKET   "\["

#token RBRACKET   "\]"

#token DOTDOT     ".."

#token LPAREN     "\("

#token RPAREN     "\)"

#token NOT_EQUALS "/="

#token LT         "<"

#token LTE        "<="

#token GT         ">"

#token GTE        ">="

#token PLUS       "\+"

#token MINUS      "\-"

#token TIMES      "\*"

#token DIV        "/"

#token MOD        "mod"

#token NOT        "not"



// Identifiers

#token IDENT "[a-zA-Z] [a-zA-Z0-9]*"



// String Literal Processing

// Separate Scanner class!

#lexclass STRING

#token           "\"\"" <<

                            more();

                            replchar('\"');

                        >>

#token BADSTRING "\n"   <<

                            replchar('\0');

                            newline();

                            mode(START);

                            /* error message */

                        >>

#token STRINGLIT "\""   <<

                            replchar('\0');

                            mode(START);

                        >>

#token           "~[]"  <<more();>>



#tokclass STRING_LITERAL {STRINGLIT BADSTRING}

           

Whoops! How'd that #tokclass sneak in there? I thought it would be nice to show how #tokclass is used while we are here. Remember that I mentioned that we want the parser to try to treat the incomplete string as a normal string? Chances are that if someone forgot the closing quote, it could cause other big problems, but we should at least try to keep going. So, in the grammar, we'll use STRING_LITERAL, which could be either STRINGLIT or BADSTRING. It's kind of shorthand for a rule that says

string_literal

  : STRINGLIT

  | BADSTRING

  ;

       

except in actuality, it is more efficient because a set inclusion test is used instead of a function call for the sub rule. But I digest...

Let's move on to the grammar.