Statements

There are seven statements in XL (I told you it was a small language): assignment, exit, procedure call, return, if-then, loop and I/O.

statement

  : assignmentStatement

  | exitStatement

  | procedureCallStatement

  | returnStatement

  | ifStatement

  | loopStatement

  | ioStatement

  ;

           

Assignment Statement

The assignment statement is pretty much like Pascal's assignment statement, except that there are no type conversions allowed.

assignmentStatement

  : variableReference BECOMES expression SEMI

  ;

           

Very simple, but the pieces (variable and expression) are somewhat complex -- we'll look at them after all the statements are done. We'll look at variableReference later...


Exit Statement

The exit statement tells a loop it's done. It looks like

exitStatement

  : EXIT WHEN expression SEMI

  ;

           

Procedure Call Statement

A procedure call takes the form

procedureCallStatement

  : IDENT {actualParameters} SEMI

  ;



actualParameters

  : LPAREN { expression (COMMA expression)* } RPAREN

  ;

           

I extended XL just a bit for this. The original XL compiler I wrote had

actualParameters

  : LPAREN expression (COMMA expression)* RPAREN

  ;

           

which basically means that if you have parens, you must have at least one expression in them. I thought it would be nice to allow myfunc() as a valid procedure call.


Return Statement

The XL return statement tells a procedure or function to return immediately. Keeping in mind that I did not implement functions (yet, at least), return cannot take an expression argument. Therefore, it's just

returnStatement

  : RETURN SEMI

  ;

           

If Statement

The XL "if-then-elsif-else-end if" statement is a fairly simple if type statement. This is because it is defined as a closed if statement; that is, there is a definite "I'm done" part to the if statement. Those of you who are familiar with the "dangling else" conflict present in the C language may realize that if you close the if statement, the conflict disappears. First, I'll present the rule for XL, then I'll talk a bit about how to do the if statement for C.

XL's if statement looks like:

ifStatement

  : IF expression THEN

      ( statement )*

      ( ELSIF expression THEN

        ( statement )*

      )*

      { ELSE ( statement )* }

    END IF SEMI

  ;

           

Which is the most straightforward way to write the rule. However, you may notice some redundancy in the rule -- there are two separate places where we'll need to deal with expression evaluation. Not a big deal now, but we'd need to duplicate code later. After staring at the rule for a bit, we might try splitting out the "expression THEN possible ELSE" stuff into a separate rule, and call it recursively if we get an ELSIF. In other words:

ifStatement

  : IF ifPart END IF SEMI

  ;



ifPart

  : expression THEN

    ( statement )*

    { ELSIF ifPart

    | ELSE (statement)*

    }

  ;

           

Which is much less redundant.

Earlier, I threatened to look at the "dangling else" problem. In PCCTS, this is an easy problem to solve. Basically, the problem is that in a statement like

if (x)

    if (y)

        myfunc();

    else

        yourfunc();

           

It isn't terribly apparent (to the compiler at least) which if-condition owns the "else" clause. The compiler can't tell if the inner if is an "if-without-an-else" inside an "if-with-an-else," or vice-versa.

On the other hand, the language designers decided that the else will be owned by its most recent if that can possibly own it. In this case, the else would clearly belong to the (y) condition. However, the compiler doesn't know this.

In a yacc grammar, an if-statement for C might be written

ifStatement

  : IF LPAREN expression RPAREN statementList

  | IF LPAREN expression RPAREN statementList

      ELSE statementList

  ;

           

which causes yacc to report a "shift/reduce" conflict -- yacc can't tell if you intended to have an "if-without-else" (the first alternative) be nested inside an "if-with-else" (the second alternative.) It doesn't know if it should reduce the first alternative when it sees the ELSE, or whether it should keep looking (shift the ELSE.) Fortunately, yacc's default is to shift the else, basically grouping the ELSE with its closest IF, which is exactly what we want.

To get yacc to stop reporting this (and explicitly specify which alternative we intend) requires the use of either some very messy (and hard-to-follow) rules, or explicit precedence assignment. Neither is pretty, or very readable for someone else.

In PCCTS, the C if statement would look like

ifStatement

  : IF LPAREN expression RPAREN (statement)*

    { ELSE (statement)* }

  ;

           

which, of course, causes ANTLR to report a conflict. To get rid of this conflict in PCCTS, we can use a syntactic predicate (aka "guess" mode) which basically says "try this -- if it works, use it. Otherwise, try the next alt."

Applying a nice syntactic predicate to the ifStatement rule above yields

ifStatement

  : IF LPAREN expression RPAREN (statement)*

    { (ELSE)? ELSE (statement)*}

  ;

           

which can a bit easier to read if you break the {...} subrule into a "this-or-nothing" subrule.

ifStatement

  : IF LPAREN expression RPAREN (statement)*

    ( (ELSE)? ELSE (statement)*

    | ( ) // nothing

    )

  ;

           

The predicate says "try matching an ELSE. If you can, then use the first alt. Otherwise, use the second alt (the "nothing" alternative.) Note that we will be using the more compact {...} form for our compiler.


Loop Statement

The loop statement in XL is pretty simple. (Anyone sick of the word "simple" yet? But it's true, really it is.) It's a set of statements enclosed by loop and end loop, with an optional while clause in front of it. It looks like

loopStatement

  : {WHILE expression} LOOP

        ( statement )*

    END LOOP SEMI

  ;

           

Not too bad, eh? Well, just to confuse matters a bit, what would happen if XL allowed an END statement, like in BASIC? (BASIC's END statement exits the program immediately.) Come to think of it, what the heck, let's add one!

endStatement

  : END SEMI

  ;

           

Now why would I be mentioning this here? Let's look at some XL sample code:

program sponge =

    var x : integer := 1;



    begin

        while x < 10 loop

            x := x + 1;

            end;

        end loop

    end sponge.

           

Kind of contrived (and useless, I know), and simple, until you look closer...

What happens when the parser sees the end inside the loop? Let's look at the loopStatement rule again (repeated here 'cause you're a programmer which means you're too lazy to turn back a page):

loopStatement

  : {WHILE expression} LOOP

        ( statement )*

    END LOOP SEMI

  ;

           

Assuming k=1 lookahead, we have a problem: how does the parser determine that the end in loop is an endStatement and not the end that is the first part of end loop? The answer is simple: with k=1 lookahead, it can't. It will always try to make it an endStatement, and will get a syntax error on loop. (Try this in the final recognizer, just for kicks.) You'll also get a conflict warning in ANTLR. (Note that you'll have the same conflict with end if in ifStatement -- our solution should, and will, handle both situations.)

So how do we solve this? Two possibilities: use k=2 lookahead, or use a syntactic predicate.

To use k=2 lookahead, you'd just need to specify -k 2 on the ANTLR command line. This will make the ANTLR run take a bit more time, but should only affect checks that require two-token lookahead.

In our grammar, we only have one place (this one) that causes a conflict. (Remember that our ifStatement is closed with an end if, so there's no dangling else conflict.) So let's fix the problem in that spot using syntactic predicates. (I also want to cover syntactic predicates a bit, so I won't go with the easier -k 2 fix... In real life, I think it might make a slightly faster parser to use -k 2 for this case. Syntactic predicates should probably be reserved for cases when you don't know how much lookaead you'll really need, or the known lookahead is lengthy.)

We can tell ANTLR exactly how to distinguish an endStatement from the end of a loop. How? Simple (for me at least, 'cause I already know...):

statement

  : (END SEMI)? endStatement

  | assignmentStatement

    . . .

  ;

           

Now what the heck does that mean? Basically, it says "try to match END, then SEMI." If it works, then we really want the endStatement. If we didn't get a match, try the next alternative, and so on...

There's one problem that I have with the above statement. I don't like spelling out END SEMI in a rule other than endStatement. It just seems like bad isolation of rules. I'm a bit picky at times, though. What I'd rather do is say to myself, hmmm, END SEMI is an endStatement, so why not put that inside the ( )? -- like this:

statement

  : (endStatement)? endStatement

  | assignmentStatement

    . . .

  ;

           

(Note: This will be a bit less efficient than spelling it out -- the "guess" that ANTLR will make will need to make a function call to endStatement to check for END SEMI rather than just directly check for END SEMI in the statement rule. However, as long as you don't do this too much, the penalty for a function call is minimal on most machines nowadays. Just keep it to a minimum!)

Does that look redundant to you? It does to me. Fortunately, ANTLR provides a shortcut: if the contents of the syntactic predicate are the same as the thing it's predicting, you only need to specify the predicate:

statement

  : (endStatement)?

  | assignmentStatement

    . . .

  ;

           

Much better now, isn't it. Let's keep this in XL. I encourage you to play with the recognizer grammar a bit to explore this conflict and its solution.

What if endStatement were a bit more involved, such as

endStatement

  : END SEMI expression SEMI

  ;

           

(Kinda useless, but so are most examples.) All we need to predict it are the END and SEMI. If we keep the predicate as it is, (endStatement)?, the "guess" would keep going through expression (which could be pretty long) and the next SEMI. That would be mighty wasteful. So, in a case where we want to limit the lookahead, it's probably better to sacrifice readability a bit and limit the lookahead to just END SEMI:

statement

  : (END SEMI)? endStatement

  | assignmentStatement

    . . .

  ;

           

(Look familiar?) You may ask "why not just put the (END SEMI)? in the endStatement rule itself?" Unfortunately, ANTLR complains about a rule like

endStatement

  : (END SEMI)? END SEMI expression SEMI

  ;

           

basically telling you "why bother with a predicate -- there's only one alternative!" I think I'd prefer to be able to specify it like this, and have ANTLR hoist the predicate up into the statement rule (and possibly optimize it out of the endStatement rule altogether.) But let's work with the great tool we currently have, shall we?

So, we end up with two new rules:

loopStatement

  : {WHILE expression} LOOP

        (statement)*

    END LOOP SEMI

  ;



endStatement

  : END SEMI

  ;

           

and we modify the statement rule to look like

statement

  : (endStatement)?

  | assignmentStatement

  | exitStatement

  | procedureCallStatement

  | returnStatement

  | ifStatement

  | loopStatement

  | ioStatement 

  ;

           

But is the order important in the statement rule? Of course, but it's not so obvious until you look at what the generated code will say. The following isn't actual generated code, but pseudocode that will show what happens in the statement rule:

statement

{

    if next token is END and guess(endStatement) worked

        do endStatement

    else if next token is IDENT

        do assignmentStatement

    else if next token is EXIT

        do exitStatement

    else if next token is IDENT

        do procedureCallStatement

    else if next token is RETURN

        do returnStatement

    else if next token is IF

        do ifStatement

    else if next token is (WHILE or LOOP)

        do loopStatement

    else if next token is (PUT or GET)

        do ioStatement (defined later)

}

           

A few things to notice here:

Our goal when parsing is to make things as fast as possible. Therefore, based on the above information, we should

If possible, determine which statements are more likely to be used. Do some statistical sampling of user code (if possible) and order the rules accordingly. In this case, I think it's less likely that the user would use an END statement, and on top of that, it's more complex, so let's put it at the bottom of the list.

We also need to disambiguate assigmentStatement and procedureCallStatement. Recall that they look like:

assignmentStatement

  : variable BECOMES expression SEMI

  ;



procedureCallStatement

  : IDENT {actualParameters} SEMI

  ;

           

So, how do we tell them apart? First, let's look at the FIRST sets of each (the sets that contain all tokens that can be the first terminal in a rule.) Impossible until we define variableReference, so let's do that now (sorry about the jumping around here..)

variableReference

  : IDENT

    ( LBRACKET expression RBRACKET

    | DOT IDENT

    )*

  ;

           

which says a variable reference is an IDENT followed by any number of array or field dereferences.

Without going into a long discussion about FIRST sets, what is the FIRST set of assignmentStatement? By inspection, it's whatever the FIRST set is of variableReference, which is {IDENT}. Again, by inspection, the FIRST set of procedureCallStatement is also {IDENT}. There, the conflict becomes very clear. On one token alone, IDENT, the parser cannot determine which rule is proper. Looking at the pseudocode above, it becomes obvious that the parser will, in fact, always choose whichever rule is specified first with a FIRST set of {IDENT}, which is bad. So, we need to add a predicate to one of the rules. But which one? Lets look at what would be needed for each. What makes each unique? assignmentStatement could start out with

IDENT BECOMES . . . // just ident := expression

IDENT LBRACKET . . .     // starting with array reference

IDENT DOT . . .          // starting with field reference

           

and procedureCallStatement could start out with (after looking at actualParameters)

IDENT SEMI    // No parameters (or parens)

IDENT LPAREN . . . // parameters (or just parens)

           

Obviously, they are unique with two tokens of lookahead. So, we're back to the decision of "do we bump lookahead to k=2, or use predicates." k=2 would resolve the conflict, but my gut says we should use a syntactic predicate here. But which one?

The predicates would look like:

statement

  : . . .

  |(IDENT (BECOMES|LBRACKET|DOT))? assignmentStatement

  |(IDENT (SEMI|LPAREN))? procedureCallStatement

           

Well, the predicate for procedureCallStatement is actually a bit simpler. We probably should use it and put its alternative ahead of the one for assignmentStatement.\ If your research proves that assignmentStatements are found in code more often than procedureCallStatements, you may want to reconsider, though.

So what do we end up with? Using our basic rule of "keep the simple stuff near the top" and adding syntactic predicates for endStatement and procedureCallStatement, we get (without taking into account statistical frequency of each statement):

statement

  : exitStatement

  | returnStatement

  | ifStatement

  | loopStatement

  | ioStatement

  | (IDENT (LPAREN|SEMI))? procedureCallStatement

  | assignmentStatement

  | (endStatement)?

  ;

           

Now, on to our final statement...


I/O Statement

The I/O statement is really a procedure call to four predefined functions, get, put, skipLine and newLine. These really should be handled just like any other procedure call, but I'm putting them here just because that's how I did it the first time around, and because I don't want to mess with special pre-defined identifiers in the symbol table later...

ioStatement

  : PUT LPAREN expression RPAREN SEMI

  | GET LPAREN variable RPAREN SEMI

  | SKIPLINE {LPAREN RPAREN} SEMI

  | NEWLINE {LPAREN RPAREN} SEMI

  ;

           

Note that I was nice and allow the user to put "()" after skipLine and newLine, just like we did for procedure calls in general.

Easy enough. Now onto the sticky stuff: expressions.