Saturday, December 25, 2010

Infinite Recursion in Parser Generators

Well, I've stuck my foot in it again and doing something which doesn't make a lot of sense.

Skipping the details, I decided I need to write a parser in PHP and the language I'm designing is embedded in PHP - which has a complex syntax and . . . anyway, one thing led to another and I've ended up writing kind of a parser generator in PHP.

It's not really a parser generator - it's more like a programmable parser where the program is a grammar specification.

So, I broke out the Dragon book and started reading, built a programmable recursive descent parser framework object and a hand coded parser for language grammars so I can can program it and a programmable lexical scanner - all in PHP and it all works pretty well.

And then . . .

I couldn't solve my problem with it.

Why, you ask?

Well, the problem I have cannot be solved by a parse tree created from a right-recursive grammar - which is what the book says a recursive descent parser needs to process.

Why?

Because when a recursive descent parser hits a left recursive production (which is what I need for my problem) it goes into an infinitely deep recursion.

Why does it do that?

It's stupid.

It turns out that no only will simple productions like: a : a TERMINAL ; create infinite recursions, but various, well hidden, mutual recursions will as well.

So - having faith in the Book - I decided maybe I need something which handles left recursive grammars. So I read and read and thought and thought and - as usually happens - I got tired, went to bed, and woke up this morning with a realization:

"It's not the recursion dummy, it's because processing non-terminals don't eat tokens!!!!!"

If that doesn't mean much to you - that's OK. The rest of this post is a boring explanation of what's happening and how to fix it.

First of all - why isn't it obvious from the book? Because it's not in there because:
  1. The book defines a mathematical formalism to describe language structure and parsing
  2. Like good mathematicians, they then ignore the actual problem and get buried in the formalism. And then . . .
  3. They come up with ways to solve problems in the formalism using programming techniques and computer constraints available at the time they are working
  4. The 2nd, 3rd, etc generation of 'students' become teachers and so they just teach the formalism in the computing context of the time of the original work
My dragon book is copyright 1977. Torben Ægidius Mogensen's "Basics of Compiler Design" is copyright 2000 through 2010 [nicely written, by the way] and the syntax analysis is a rehash of the stuff in the Dragon book [to be fair, I didn't read it all, but this is true to the margin of error inherent with a quick skim]

Believe it or not, things have changed.

The Apple 2 computer didn't exist in 1977 (I don't think it did. I got mine in 1979 or 1980) and it maxed out a a whopping 64 Kilobytes of RAM [that's 1024 bytes]. The processor executed one instruction about every couple of microseconds. In other words, both memory and speed were very very limited, so a lot of work went into algorithm design - at the expense of clarity and simplicity of code.

As a result, the compiler generators tend to avoid recursion ["function calls are expensive and take a lot of RAM"], but rather tended towards memory and speed efficient algorithms. As a result, the compiler generator section of the Dragon book is heavy into table driven parsers using conventional, non-recursive, non-functional programming techniques.

And - finally getting to the point - they are so deep into formalism and computing environment, they never actually get to the point of "what causes infinite recursion in parsers".

Well, here's the answer: any algorithm which revisits the same non-terminal without consuming a terminal symbol will infinitely recurse.

Huh?

This highlights another problem in understanding compiler generation: the compiler-eze terminology stinks. It emphasizes the algorithms, not the problem we're trying to solve.

So, here's what the Parsing Problem is:

Given a string of characters, does it make sense in a specific, grammatical language?

OK - that's not specific enough to answer. So let's make it more concrete:

First we will define a bunch of things we will call words and symbols. A word will be a string of 'letters' without any spaces in them. In English we also allow hyphens, so 'cat-food' could be classified as a word. In PHP a word might be a reserved word - 'foreach' or 'if' - or something like that. Anyway, we decide how to find these things in a string of characters.

We're going to call the things we find 'tokens' and it's the job of the 'lexical analyzer' to eat the string of characters and spit out an ordered list of 'tokens'.

These tokens are what the Language Grammarians call 'terminals' or 'terminal symbols'.

I'd rather call them 'tokens' or 'words' because that puts the focus back on what they are in the language. The term 'terminal' puts the focus on the activity of the parser - which we haven't gotten to yet.

Now, you might try to build a grammar description using only 'tokens', but it would get pretty large pretty fast and it would be really limited.

So you need something else. You need things which represent parts of the language. For example, you might need something called a 'sentence' [starts with a capitalized word and ends in a terminating punctuation mark: . or ! or ?] and maybe a 'paragraph' and maybe . . . well you get the idea.

These things which represent parts of the language can be composed of tokens or other parts of languages. In fact, in order to be really useful, these parts need to be able to refer to themselves as part of their definition - that is 'be recursively defined'.

For example, lets say I have only four words: A, B, C, and D. I also have a couple of symbols, say AND and OR. That's my whole vocabulary.

Now let's say I want to construct sentences. I might say something like:
sentence : A AND B | A OR B | C | D ;

where I'm using ':' to mean 'defined as' and '|' as 'or it might be' and ';' for 'that's all folks'.

But this is kind of limiting. So let's say I want to build more sentences than I can list using only words and symbols.

word : A | B | C | D;
sentence : sentence AND sentence | sentence OR sentence | word ;

In compiler-eze, these parts of sentences are called 'non-terminals' - again, putting the emphasis on the process of parsing [the parser can't stop on a non-terminal] rather than on the structure of the language. I'm going to call them 'fragments'.

Now, there are two ways I can use a grammar:
  1. I can build sentences using it - which you do all the time: writing, speaking, creating programs, etc.
  2. I can transform strings of characters (or sounds) into sentences so I can understand them - this is called 'parsing'
Before we get to parsing, let's look at how we can use the grammar to create a sentence.

Let's say I want to build a sentence - but I really don't care what it means, only that's it's grammatically correct.

I'll start with the fragment sentence. But this doesn't get me a string of characters. Grammars can only build sequences of 'fragments' and 'tokens'. Tokens are made up of sequences of characters - which is what I want - but 'fragments' aren't: they are made up of 'fragments' and 'tokens'.

So, in order to build a character string - or say something in the language - I have to get rid of all the 'fragments' so that I have a string of 'tokens' which I can (at least theoretically) feed to the un-lexical un-scanner which will produce a string of characters - which I can then print in a book.

So how do I proceed? (the arrow (->) means 'replace a 'fragment' on the left with one of the alternatives of the right side of the definition of the 'fragment' in the sequence and write it on the right side of the arrow.) (which is easier to do than say)

sentence -> sentence AND sentence -> A AND sentence -> A AND D

and now I'm done. I have 'produced' a sequence of 'tokens' [TERMINALS in compiler-eze]
which I can un-lexical analyze to produce a sequence of characters.

Now in compiler-eze, the alternatives on the right side of the definition of 'sentence' are called 'productions', because replacing a 'fragment' by one of them 'produces' something which is grammatically correct.


Ok - this is pretty straight forward, if boring. So let's turn to the 'parsing'. That is, given a string of characters, is it a grammatically correct sentence?

The mathematicians would say 'it's grammatically correct if (and only if) there is a sequence of replacement operations I can find using productions which will generate the sentence'. So - as they would have it - they have 'reduced' the problem of 'parsing' to finding a sequence of productions which will produce the sentence.

How do we do that? The Dragon book starts by analyzing algorithms, but let's take a different approach: let's look at what we do when 'parsing' a sentence somebody says or that we've just read.

What I think you do (or we do) is look over the sentence and divide it up into chunks which make some sort of sense. Like 'Joe ran through the forrest'. Well, what's this about? 'Joe'
What did he do? 'ran' Where did he do it? 'through the forrest'. Stuff like that.

Let's formalize this procedure:

First we'll lexically analyze the sentence: for 'Joe ...' this amounts to classifying each word according to its possible uses:
  1. 'Joe' - is a noun and a name. It can be used in a subject or the object of a phrase
  2. 'ran' - is a verb. It can be used as a 'verb', as part of a predicate, part of a compound verb, or a phase ['seen to run']
  3. etc
Then we start parsing by examining the first token: Joe. Some sentences start with a noun, so we put 'Joe' on the shelf and look at the next word to see if it fits with how sentences which start with nouns are constructed. etc.

The point is, we are scanning from left to right and trying sentence forms to see if they fit
the noise we just heard or read. [left to right, right to left, up to down - doesn't matter so much as the fact that it's really focusing on one word at a time in a consistent order].

So, in parsing we have two scans going on:
  1. we are scanning the token stream
  2. we are also scanning across a production to see if these tokens fit into it
The 'parse' terminates when the token stream is exhausted and all the tokens have been stuffed into 'fragments' OR something won't fit into any fragment. This is controlled by the sequence of scans across productions. Each time we start scanning, we start with some 'fragment' definition and exhaustively try all of it's productions to find a fit with the token stream - remember that we are scanning the stream left to right. So the only way to get into an infinite recursion is to find a production scan which does not terminate.

Scanning a production terminates on one of three ways:
  1. a segment of the token stream matches the entire production - then the production is accepted. Accepting means that we don't have to look at those tokens any more and we can make a record of the fragment we recognized. [in compiler-eze we then 'reduce' by replacing the production by it's non-terminal in the non-terminal definition (again, emphasis on algorithm rather than process]
  2. a token doesn't fit, in which case the production is rejected.
  3. the production can be empty - and so it's trivially satisfied. [I forgot this early and have to think some more about it. Golly! that's meat for another post on this topic]
So if - in our scan across the production - we never look at any 'tokens', we will never terminate the scan. How can this happen?

Here's an artificial example:

frag1 : frag2 | WORD1 ;
frag2 : frag1 | WORD2 ;

No matter what I scan, my production scan will first look for frag2 which will look for frag1 which will look for frag2 which will . . . and I will never examine a token, so I will never reach the end of the token stream.

To go to a less artificial example, let's go back to my A, B, C, D language.

I'm given with a sentence A AND C and I want to see if it can be produced by the grammar. I decide to 'run the grammar backward' to see if I can find a sequence of substitutions which work.

OK, I start by guessing it's a 'sentence', so I write down:

sentence

Now I say - 'what production might this be? Let's try the first one!', so I grab:

sentence AND sentence

Now you can look at the whole sentence and say 'Yep!!! It fits', but the computer will only look at what it's programmed to do. So, lets say that I've programmed up a recursive descent parser, which works by defining a function for each 'fragment' which it calls when it sees it's name in a production.

So my 'parser' will see 'sentence' and call the 'sentence' function which will then look at the first production and will see sentence and will call the 'sentence' function and . . .

And there you are - infinite recursion.

So we can't use a recursive descent parer. Right? Well, . . .

The recursion isn't caused by the parsing method, it's caused by any algorithm which attempts to match the same 'fragment' twice without recognizing and moving past a 'token'.

So infinite recursion in parsing results from designing an algorithm (any algorithm) which can cycle through a sequence of 'fragments' without ever recognizing (and using) a 'token'.

So, can I patch up a 'recursive descent parser' so that it handles 'right recursion' and other forms of infinite recursion?

Sure. I just have to keep track of my progress through the token stream and reject any production in which a 'fragment' occurs which I'm in the (recursive) process of examining AND which is at the same place in the token stream as it was before. Again, this will be easier to code than to write.

I'll post a note when I've finished fixing this thing - in case you want to look at the code

Mike



No comments: