Scannerless Parsers and Compile-Time Evaluation

April 08, 2005, at 12:00 AM

I have an implementation of Tomita's GLR Algorithm, which is able to parse any context-free grammar, including ambiguous ones. Earley's Algorithm seems to be getting more attention lately, and I'm not sure why, since as far as I know it's strictly less efficient. What I am attempting to do is expand my framework to be a scannerless parser, which would allow rapid development of parsers for just about anything, in native Lisp.

Tomita's Algorithm, like the traditional yacc LALR(1) algorithm and its variants, precomputes a parse table which drives the actual parse. The difference is that the parse stack isn't really a stack at all, it's a DAG, which means that when you there are shift-reduce or reduce-reduce conflicts, the parser is able to nondeterministically try both branches. A backtracking recursive-descent parser can do that, too - but here there's no exponential blow-up.

A scannerless parser essentially combines the roles of lex and yacc. Of course, the reason lexical analyzers have typically been separate is that most programming-language grammars are not context-free. Happily, GLR lends itself handily to extensions which address this. Right now, I have implemented one such extension, reject productions, and I am thinking about whether I should also implement follow restrictions. Reject productions say "don't allow anything that matches this," and follow restrictions say "don't allow matches which are followed by this character."

I'm pretty much done with adding features now, and am working on packaging and polishing. I still consider the code alpha-quality, and I'm going to need to clean it up quite a bit before I put it up...

This brings us to the subject of compile-time evaluation. Since generating the parse table does take some time, of course it needs to be pregenerated and saved for the next invocation of the parser. Traditional parser generators do this by outputting a new source file which contains the entire runtime component of the parser, as well as the parse table. For example, your input file would be grammar.g; the generator would output grammar.cpp, which you would then compile into grammar.o.

Although I haven't entirely convinced myself that there's anything wrong with doing things that way, with Lisp's introspective capabilities, it seems as though it ought to be able to do better than this. Couldn't the parse table just be computed when you compile the file, and stored into the fasl?

I'm not sure exactly how to do that, though. It looks like eval-when doesn't help with that; it can compute the parse table at compile time, but I don't see any way to save it for later loading. The best I can think of is to define a macro which does the computation at macro-expansion time. Does anybody else have suggestions?

Here are some papers which explain more about this type of parser:

"An Efficient Augmented-Context-Free Parsing Algorithm" -- Tomita 1987, 16 pages

"Scannerless Generalized-LR Parsing" -- Visser 1997, 54 pages

"Disambiguation Filters for Scannerless Generalized LR Parsers" -- Brand, Scheerder, Vinju, and Visser 2002, 15 pages