Context-Sensitive Scanning in ANTLR

The other day I was working on a DSL in xtext, and I really wanted to use a lot of keywords. This would make content-assist incredibly helpful and easy, as well as making model-creation trivial.

The trouble is that as the number of keywords increases, the number of possible identifier names decreases.

Unless we start thinking context-sensitive.

Traditional approaches usually follow either:

I thought about it for a while and came up with what I think is a very elegant and simple solution. I was starting to write a paper on the idea, but after a lot of web searching I found a few examples of pretty much what I'm thinking, albeit with different implementations.

The Basic Idea

The concept breaks down as follows:

Poof. We now have context-sensitive scanning. But let's take the idea a little further...

NOTE: Doing this could make error correction and reporting trickier... which context is used to help resynch when we have nested language contexts... (I think error reports would need to just display the next n characters of input rather than the name of the token, as we can't really tell which token is next if there's no match...)

That's the basic idea. Let's look into some examples and details.

Details

I haven't tried coding anything up, but here are some details of what I'm thinking for discussion.

Text Stream Input Buffer

The stream being used for input will be shared between one or more lexers.

Only one lexer at a time will access the input buffer.

The buffer position can be queried and reset. OTTOMH I'm thinking that the buffer will have a stack of remembered positions to allow for different lookahead under nested parser calls. Gotta think about lookahead, and that lookahead can actually be coming from multiple parsers!

Lexers

The biggest change in this approach is that instead of the lexer being an autonomous entity that figures out which token comes next all on its own. Instead, parsers explicitly pass a list of tokens that are possible in the current context, and the lexer returns the first that matches from that list. (This could make ambiguity reporting trickier -- perhaps when generating the lexer we can determine which lexer rules are ambiguous with each other and then use this information when generating the parser to flag ambiguous token references?)

I think of this as "pull scanning" rather than the traditional "push scanning".

Figure 1: The lexer's nextToken() method

// kinda pseudo-codey
// basic idea: take list of possible token types from parser
//             return first that matches or null if none match
public Token nextToken(int... possibleTokenTypes) {
    Token token;
    while (true) {
        for (int tokenType : possibleTokenTypes) {
            switch (tokenType) {
                case INT: token = INT(); if (token != null) return token; break;
                case IDENT: token = IDENT(); if (token != null) return token; break;
                ...
            }
        }
        // attempt to gobble whitespace if no token matched
        // hmmm... whitespace may be contextual
        //         gotta think on this part a bit, but basically we should
        //         try to gobble whitespace if we haven't seen a valid token
        if (WS()) continue;
        if (SL_COMMENT()) continue;

        return null; // if nothing matches
    }
}

After I thought about this, I did some research and saw that this approach has been used in the yapps parser generator (http://theory.stanford.edu/~amitp/yapps/yapps-doc/node5.html#SECTION00052000000000000000). So much for that SIGPLAN paper I was gonna write...

This is significantly more flexible than the traditional start states that tools like flex allow. The contextual state truly depends on the parser state and lookahead is not an issue (as the parser actually requests context-sensitive lookahead).

Parsers

Each parser would have a lookahead method, something like the following.

Figure 2: The parser's lookahead() method

void lookahead(int la, Lexer lexer, int... possibleTokens) {
    if (lt[la] != null) {
        if (lt[la].context == possibleTokens) // order counts!
            return; // already have valid lookahead token
        // otherwise reset the buffer for the next attempt
        lt[la] = null;
        buffer.reset(lt[la-1].end+1);
        // need to reset to char after last accepted token
        // note that different scanner states might have different whitespace
        // gobbling...
    }
    lt[la] = lexer.nextToken(possibleTokens);
}
void consume(Lexer lexer, int type) {
  lt1 = lookahead(1, lexer, type);
  if (lt1.type == type)
    // shift lts & return
  throw mismatch
}

Now we can add contextual token requests to generated rules. Perhaps we have a language where you could have context sensitive keyword "field" describing a field in a bean, but "field" could also be an identifier (like the name of the type containing the field)

Figure 3: Simple DSL with context-sensitive keywords

type field { // type is keyword; field is ident
    field name;  // field is keyword; name is ident 
    field type;  // field is keyword; type is ident
}

This is really contrived but my brain is a little tired... Note I'll start this example having keywords defined in a separate lexer. This isn't necessary, but helps to demonstrate the possible use of mutiple lexers.

Note that I haven't dug around in the generated code in a while, but hopefully these bits will help understand the concept so if y'all like the idea you can carry it forward.

Figure 4: Definition of member rule

type
  : TYPE IDENT LB field* RB ;

field : FIELD IDENT SEMI;

Figure 5: Oversimplified rule methods

public void type() {
    consume(lexer1, TYPE);
    // we will only attempt to lookup IDENT in this context
    //   no other lexer rules will be attempted
    consume(lexer2, IDENT);
    consume(lexer2, LB);
    while(matches) {
        field();
    }
    consume(lexer2, RB);
}
    
public void field() {
    consume(lexer1, FIELD);
    consume(lexer2, IDENT);
    consume(lexer2, SEMI);
}

Something more interesting might have alts. Suppose we allowed assignments in that DSL:

Figure 6: Evil example of doom

type field { // type is keyword; field is ident
    field name;  // field is keyword; name is ident 
    field type;  // field is keyword; type is ident
    name = 'Scott'; // name is ident!
}

Now this gets trickier. We're contextual based on LA2. We might have grammar rules like:

Figure 7: Grammar rules for evilness

type
  : TYPE IDENT LB member* RB ;

member
  : field
  | assign
  ;

field : FIELD IDENT SEMI;
assign : IDENT EQUALS STRING SEMI;

Let's assume all tokens are in the same lexer (just to show multi-token-type calls to nextToken). We might have code like

Figure 8: Generated rule methods for evil example

public void type() {
    consume(lexer, TYPE);
    consume(lexer, IDENT);
    consume(lexer, LB);
    while(matches) {
        member();
    }
    consume(lexer, RB);
}
    
public void member() {
    Token lt1 = lookahead(1, lexer, FIELD, IDENT);
    if (lt1 == null)
       error!
    if (lt1.type == FIELD) {
        // might be a field definition... check lt2
        Token lt2 = lookahead(2, lexer, IDENT);
        if (lt2 != null) {
            field();
            return;
        }
    }
    Token lt2 = lookahead(2, lexer, EQUALS);
    if (lt2 == null)
        error!
    assign();  
}

public void field() {
    consume(lexer, FIELD);
    consume(lexer, IDENT);
    consume(lexer, SEMI);
}
public void assign() {
    consume(lexer, IDENT);
    consume(lexer, EQUALS);
    consume(lexer, STRING);
    consume(lexer, SEMI);
}

If we had multiple lexers, the code might be a little different:

Figure 9: Generated rule methods for evil example (multiple lexers)

public void type() {
    consume(lexer1, TYPE);
    consume(lexer2, IDENT);
    consume(lexer2, LB);
    while(matches) {
        member();
    }
    consume(lexer2, RB);
}
    
public void member() {
    Token lt1 = lookahead(1, lexer1, FIELD);
    if (lt1 != null) {
        // might be a field definition... check lt2
        Token lt2 = lookahead(2, lexer2, IDENT);
        if (lt2 != null) {
            field();
            return;
        }
    }
    lt1 = lookahead(1, lexer2, IDENT);
    if (lt1 == null)
        error!
    Token lt2 = lookahead(2, lexer2, EQUALS);
    if (lt2 == null)
        error!
    assign();  
}

public void field() {
    consume(lexer1, FIELD);
    consume(lexer2, IDENT);
    consume(lexer2, SEMI);
}
public void assign() {
    consume(lexer2, IDENT);
    consume(lexer2, EQUALS);
    consume(lexer2, STRING);
    consume(lexer2, SEMI);
}

Wrapup

Hopefully this gets the idea across. I think we could very easily allow context-sensitive keywords in ANTLR is we wanted to. In addition, this might actually improve the performance of lexers by only attempting to match tokens that might be valid in the current parser context.

 

Similar Ideas

Here are a couple of similar ideas I found online. I came up with the idea independently, but I can't claim it's original. Sigh...

(It took quite a bit of digging with various search terms to find...)