Statements
There are seven statements in XL (I told you it was a small language): assignment, exit, procedure call, return, if-then, loop and I/O.
statement
: assignmentStatement
| exitStatement
| procedureCallStatement
| returnStatement
| ifStatement
| loopStatement
| ioStatement
;
Assignment Statement
The assignment statement is pretty much like Pascal's assignment statement, except that there are no type conversions allowed.
assignmentStatement
: variableReference BECOMES expression SEMI
;
Very simple, but the pieces (variable and expression) are somewhat complex -- we'll look at them after all the statements are done. We'll look at variableReference later...
Exit Statement
The exit statement tells a loop it's done. It looks like
exitStatement
: EXIT WHEN expression SEMI
;
Procedure Call Statement
A procedure call takes the form
procedureCallStatement
: IDENT {actualParameters} SEMI
;
actualParameters
: LPAREN { expression (COMMA expression)* } RPAREN
;
I extended XL just a bit for this. The original XL compiler I wrote had
actualParameters
: LPAREN expression (COMMA expression)* RPAREN
;
which basically means that if you have parens, you must have at least one expression in them. I thought it would be nice to allow myfunc() as a valid procedure call.
Return Statement
The XL return statement tells a procedure or function to return immediately. Keeping in mind that I did not implement functions (yet, at least), return cannot take an expression argument. Therefore, it's just
returnStatement
: RETURN SEMI
;
If Statement
The XL "if-then-elsif-else-end if" statement is a fairly simple if type statement. This is because it is defined as a closed if statement; that is, there is a definite "I'm done" part to the if statement. Those of you who are familiar with the "dangling else" conflict present in the C language may realize that if you close the if statement, the conflict disappears. First, I'll present the rule for XL, then I'll talk a bit about how to do the if statement for C.
XL's if statement looks like:
ifStatement
: IF expression THEN
( statement )*
( ELSIF expression THEN
( statement )*
)*
{ ELSE ( statement )* }
END IF SEMI
;
Which is the most straightforward way to write the rule. However, you may notice some redundancy in the rule -- there are two separate places where we'll need to deal with expression evaluation. Not a big deal now, but we'd need to duplicate code later. After staring at the rule for a bit, we might try splitting out the "expression THEN possible ELSE" stuff into a separate rule, and call it recursively if we get an ELSIF. In other words:
ifStatement
: IF ifPart END IF SEMI
;
ifPart
: expression THEN
( statement )*
{ ELSIF ifPart
| ELSE (statement)*
}
;
Which is much less redundant.
Earlier, I threatened to look at the "dangling else" problem. In PCCTS, this is an easy problem to solve. Basically, the problem is that in a statement like
if (x)
if (y)
myfunc();
else
yourfunc();
It isn't terribly apparent (to the compiler at least) which if-condition owns the "else" clause. The compiler can't tell if the inner if is an "if-without-an-else" inside an "if-with-an-else," or vice-versa.
On the other hand, the language designers decided that the else will be owned by its most recent if that can possibly own it. In this case, the else would clearly belong to the (y) condition. However, the compiler doesn't know this.
In a yacc grammar, an if-statement for C might be written
ifStatement
: IF LPAREN expression RPAREN statementList
| IF LPAREN expression RPAREN statementList
ELSE statementList
;
which causes yacc to report a "shift/reduce" conflict -- yacc can't tell if you intended to have an "if-without-else" (the first alternative) be nested inside an "if-with-else" (the second alternative.) It doesn't know if it should reduce the first alternative when it sees the ELSE, or whether it should keep looking (shift the ELSE.) Fortunately, yacc's default is to shift the else, basically grouping the ELSE with its closest IF, which is exactly what we want.
To get yacc to stop reporting this (and explicitly specify which alternative we intend) requires the use of either some very messy (and hard-to-follow) rules, or explicit precedence assignment. Neither is pretty, or very readable for someone else.
In PCCTS, the C if statement would look like
ifStatement
: IF LPAREN expression RPAREN (statement)*
{ ELSE (statement)* }
;
which, of course, causes ANTLR to report a conflict. To get rid of this conflict in PCCTS, we can use a syntactic predicate (aka "guess" mode) which basically says "try this -- if it works, use it. Otherwise, try the next alt."
Applying a nice syntactic predicate to the ifStatement rule above yields
ifStatement
: IF LPAREN expression RPAREN (statement)*
{ (ELSE)? ELSE (statement)*}
;
which can a bit easier to read if you break the {...} subrule into a "this-or-nothing" subrule.
ifStatement
: IF LPAREN expression RPAREN (statement)*
( (ELSE)? ELSE (statement)*
| ( ) // nothing
)
;
The predicate says "try matching an ELSE. If you can, then use the first alt. Otherwise, use the second alt (the "nothing" alternative.) Note that we will be using the more compact {...} form for our compiler.
Loop Statement
The loop statement in XL is pretty simple. (Anyone sick of the word "simple" yet? But it's true, really it is.) It's a set of statements enclosed by loop and end loop, with an optional while clause in front of it. It looks like
loopStatement
: {WHILE expression} LOOP
( statement )*
END LOOP SEMI
;
Not too bad, eh? Well, just to confuse matters a bit, what would happen if XL allowed an END statement, like in BASIC? (BASIC's END statement exits the program immediately.) Come to think of it, what the heck, let's add one!
endStatement
: END SEMI
;
Now why would I be mentioning this here? Let's look at some XL sample code:
program sponge =
var x : integer := 1;
begin
while x < 10 loop
x := x + 1;
end;
end loop
end sponge.
Kind of contrived (and useless, I know), and simple, until you look closer...
What happens when the parser sees the end inside the loop? Let's look at the loopStatement rule again (repeated here 'cause you're a programmer which means you're too lazy to turn back a page):
loopStatement
: {WHILE expression} LOOP
( statement )*
END LOOP SEMI
;
Assuming k=1 lookahead, we have a problem: how does the parser determine that the end in loop is an endStatement and not the end that is the first part of end loop? The answer is simple: with k=1 lookahead, it
can't. It will always try to make it an endStatement, and will get a syntax error on loop. (Try this in the final recognizer, just for kicks.) You'll also get a conflict warning in ANTLR. (Note that you'll have the same conflict with end
if in ifStatement -- our solution should, and will, handle both situations.)
So how do we solve this? Two possibilities: use k=2 lookahead, or use a syntactic predicate.
To use k=2 lookahead, you'd just need to specify -k 2 on the ANTLR command line. This will make the ANTLR run take a bit more time, but should only affect checks that require two-token lookahead.
In our grammar, we only have one place (this one) that causes a conflict. (Remember that our ifStatement is closed with an end if, so there's no dangling else conflict.) So let's fix the problem in that spot using
syntactic predicates. (I also want to cover syntactic predicates a bit, so I won't go with the easier -k 2 fix... In real life, I think it might make a slightly faster parser to use -k 2 for this case. Syntactic predicates should probably be reserved for
cases when you don't know how much lookaead you'll really need, or the known lookahead is lengthy.)
We can tell ANTLR exactly how to distinguish an endStatement from the end of a loop. How? Simple (for me at least, 'cause I already know...):
statement
: (END SEMI)? endStatement
| assignmentStatement
. . .
;
Now what the heck does that mean? Basically, it says "try to match END, then SEMI." If it works, then we really want the endStatement. If we didn't get a match, try the next alternative, and so on...
There's one problem that I have with the above statement. I don't like spelling out END SEMI in a rule other than endStatement. It just seems like bad isolation of rules. I'm a bit picky at times, though. What I'd rather do
is say to myself, hmmm, END SEMI is an endStatement, so why not put that inside the ( )? -- like this:
statement
: (endStatement)? endStatement
| assignmentStatement
. . .
;
(Note: This will be a bit less efficient than spelling it out -- the "guess" that ANTLR will make will need to make a function call to endStatement to check for END SEMI rather than just directly check for END
SEMI in the statement rule. However, as long as you don't do this too much, the penalty for a function call is minimal on most machines nowadays. Just keep it to a minimum!)
Does that look redundant to you? It does to me. Fortunately, ANTLR provides a shortcut: if the contents of the syntactic predicate are the same as the thing it's predicting, you only need to specify the predicate:
statement
: (endStatement)?
| assignmentStatement
. . .
;
Much better now, isn't it. Let's keep this in XL. I encourage you to play with the recognizer grammar a bit to explore this conflict and its solution.
What if endStatement were a bit more involved, such as
endStatement
: END SEMI expression SEMI
;
(Kinda useless, but so are most examples.) All we need to predict it are the END and SEMI. If we keep the predicate as it is, (endStatement)?, the "guess" would keep going through expression (which could be pretty long) and the next SEMI. That would be mighty wasteful. So, in a case where we want to limit the lookahead, it's probably better to sacrifice readability a bit and limit the lookahead to just END SEMI:
statement
: (END SEMI)? endStatement
| assignmentStatement
. . .
;
(Look familiar?) You may ask "why not just put the (END SEMI)? in the endStatement rule itself?" Unfortunately, ANTLR complains about a rule like
endStatement
: (END SEMI)? END SEMI expression SEMI
;
basically telling you "why bother with a predicate -- there's only one alternative!" I think I'd prefer to be able to specify it like this, and have ANTLR hoist the predicate up into the statement rule (and possibly optimize it out of the endStatement
rule altogether.) But let's work with the great tool we currently have, shall we?
So, we end up with two new rules:
loopStatement
: {WHILE expression} LOOP
(statement)*
END LOOP SEMI
;
endStatement
: END SEMI
;
and we modify the statement rule to look like
statement
: (endStatement)?
| assignmentStatement
| exitStatement
| procedureCallStatement
| returnStatement
| ifStatement
| loopStatement
| ioStatement
;
But is the order important in the statement rule? Of course, but it's not so obvious until you look at what the generated code will say. The following isn't actual generated code, but pseudocode that will show what happens in the statement
rule:
statement
{
if next token is END and guess(endStatement) worked
do endStatement
else if next token is IDENT
do assignmentStatement
else if next token is EXIT
do exitStatement
else if next token is IDENT
do procedureCallStatement
else if next token is RETURN
do returnStatement
else if next token is IF
do ifStatement
else if next token is (WHILE or LOOP)
do loopStatement
else if next token is (PUT or GET)
do ioStatement (defined later)
}
A few things to notice here:
- Most of the conditions in this code are simple, one-test comparisons
- The condition for
endStatementis much more complex - Oooops! I see a conflict between
assignmentStatementandprocedureCallStatement-- hold this thought for a minute... - Notice that
loopStatement's lookup condition had two parts to it: WHILE or LOOP -- this is because the WHILE section is optional, making the FIRST set forloopStatementbe {WHILE, LOOP} (ioStatementwill be defined in a bit as a GET or PUT call...) - All of the conditions are evaluated in the order that the rules are specified
- All of the conditions except
assignmentStatementandprocedureCallStatementhave a unique test condition; their FIRST sets are disjoint
Our goal when parsing is to make things as fast as possible. Therefore, based on the above information, we should
- Order the rules such that the most likely/most used rules come first -- that means less conditions to test
- Order the rules such that rules with nasty conditions come later. Note that if the nasty condition is required to disambiguate two alternatives, it needs to be in the first of the two alternatives
If possible, determine which statements are more likely to be used. Do some statistical sampling of user code (if possible) and order the rules accordingly. In this case, I think it's less likely that the user would use an END statement, and on top of that, it's more complex, so let's put it at the bottom of the list.
We also need to disambiguate assigmentStatement and procedureCallStatement. Recall that they look like:
assignmentStatement
: variable BECOMES expression SEMI
;
procedureCallStatement
: IDENT {actualParameters} SEMI
;
So, how do we tell them apart? First, let's look at the FIRST sets of each (the sets that contain all tokens that can be the first terminal in a rule.) Impossible until we define variableReference, so let's do that now (sorry about the
jumping around here..)
variableReference
: IDENT
( LBRACKET expression RBRACKET
| DOT IDENT
)*
;
which says a variable reference is an IDENT followed by any number of array or field dereferences.
Without going into a long discussion about FIRST sets, what is the FIRST set of assignmentStatement? By inspection, it's whatever the FIRST set is of variableReference, which is {IDENT}. Again, by inspection, the FIRST set of procedureCallStatement
is also {IDENT}. There, the conflict becomes very clear. On one token alone, IDENT, the parser cannot determine which rule is proper. Looking at the pseudocode above, it becomes obvious that the parser will, in fact, always choose whichever rule is
specified first with a FIRST set of {IDENT}, which is bad. So, we need to add a predicate to one of the rules. But which one? Lets look at what would be needed for each. What makes each unique? assignmentStatement could start out with
IDENT BECOMES . . . // just ident := expression
IDENT LBRACKET . . . // starting with array reference
IDENT DOT . . . // starting with field reference
and procedureCallStatement could start out with (after looking at actualParameters)
IDENT SEMI // No parameters (or parens)
IDENT LPAREN . . . // parameters (or just parens)
Obviously, they are unique with two tokens of lookahead. So, we're back to the decision of "do we bump lookahead to k=2, or use predicates." k=2 would resolve the conflict, but my gut says we should use a syntactic predicate here. But which one?
The predicates would look like:
statement
: . . .
|(IDENT (BECOMES|LBRACKET|DOT))? assignmentStatement
|(IDENT (SEMI|LPAREN))? procedureCallStatement
Well, the predicate for procedureCallStatement is actually a bit simpler. We probably should use it and put its alternative ahead of the one for assignmentStatement.\ If your research proves that assignmentStatements
are found in code more often than procedureCallStatements, you may want to reconsider, though.
So what do we end up with? Using our basic rule of "keep the simple stuff near the top" and adding syntactic predicates for endStatement and procedureCallStatement, we get (without taking into account statistical
frequency of each statement):
statement
: exitStatement
| returnStatement
| ifStatement
| loopStatement
| ioStatement
| (IDENT (LPAREN|SEMI))? procedureCallStatement
| assignmentStatement
| (endStatement)?
;
Now, on to our final statement...
I/O Statement
The I/O statement is really a procedure call to four predefined functions, get, put, skipLine and newLine. These really should be handled just like any other procedure call, but I'm putting them here just because that's how I did it the first time around, and because I don't want to mess with special pre-defined identifiers in the symbol table later...
ioStatement
: PUT LPAREN expression RPAREN SEMI
| GET LPAREN variable RPAREN SEMI
| SKIPLINE {LPAREN RPAREN} SEMI
| NEWLINE {LPAREN RPAREN} SEMI
;
Note that I was nice and allow the user to put "()" after skipLine and newLine, just like we did for procedure calls in general.
Easy enough. Now onto the sticky stuff: expressions.