[fonc] Earley Parsing Explained (incomplete first draft)

Discussion:

Loup Vaillant-David

2014-09-18 21:11:13 UTC

Hi,

After spending months banging my head over Earley Parsing, I have
decided to write a tutorial. Ian once said Earley parsing is simple
and easy to implement. I agree with "simple", but not with "easy".
The required background knowledge is not trivial.

This tutorial is an attempt to gather this knowledge in one place.
Nothing fancy, no deep math, no proof of correctness. Just a
(hopefully) intuitive explanation of the concepts, needed to implement
Earley parsing. The goal is to help competent programmers who know
little about parsing to write their own Earley parsing framework.

So far, I have done most of the recogniser. The following pages are
"done", and up for review:

http://loup-vaillant.fr/tutorials/earley-parsing/
http://loup-vaillant.fr/tutorials/earley-parsing/what-and-why
http://loup-vaillant.fr/tutorials/earley-parsing/chart-parsing
http://loup-vaillant.fr/tutorials/earley-parsing/recogniser

Questions and criticisms are most welcome. I'd like to know about any
factual inaccuracy, poor wording, confusing explanation… please don't
hesitate to question anything, even the structure of this tutorial.

Enjoy, and thanks,
Loup.

Julian Leviston

2014-09-18 23:28:20 UTC

Permalink

Post by Loup Vaillant-David
Hi,
After spending months banging my head over Earley Parsing, I have
decided to write a tutorial. Ian once said Earley parsing is simple
and easy to implement. I agree with "simple", but not with "easy".
The required background knowledge is not trivial.

Easy is pretty subjective! What's easy to my a 6 year old is not going to be easy for a 50 year old, and vice versa.
Perhaps you mean approachable to someone of average intellect and cultural heritage?

Post by Loup Vaillant-David
This tutorial is an attempt to gather this knowledge in one place.
Nothing fancy, no deep math, no proof of correctness. Just a
(hopefully) intuitive explanation of the concepts, needed to implement
Earley parsing. The goal is to help competent programmers who know
little about parsing to write their own Earley parsing framework.
So far, I have done most of the recogniser. The following pages are
http://loup-vaillant.fr/tutorials/earley-parsing/
http://loup-vaillant.fr/tutorials/earley-parsing/what-and-why
http://loup-vaillant.fr/tutorials/earley-parsing/chart-parsing
http://loup-vaillant.fr/tutorials/earley-parsing/recogniser
Questions and criticisms are most welcome. I'd like to know about any
factual inaccuracy, poor wording, confusing explanation please don't
hesitate to question anything, even the structure of this tutorial.
Enjoy, and thanks,
Loup.

I really like your writing style!

One criticism I'd have is there aren't any pointers to people who don't have programming language construction backgrounds. So, perhaps adding a pre-requisite reading list at the beginning aimed at bringing someone who has finished high school and nothing else "up to speed"...

This is perhaps my general criticism of most technical work. The assumed knowledge is high, and it shouldn't be. As more and more people get involved in computing, this means there can be less and less assumed backgrounds. I wrote about good code and by extension good writing a while back here: http://www.genericoverlords.com/good_coding

Julian

http://www.getcontented.com.au/ - You Need GetContented - Get Your Website Happy. :)

Josh Grams

2014-09-20 10:58:14 UTC

Permalink

How's that for coincidence? I had just finally (on the 18th) got
around to watching Ian's "Trap a Better Mouse" talk and starting
to try it myself, and then saw that you posted this. I've done
some parsing before, so you haven't (yet) covered any of the parts
that I had trouble understanding. I *think* I've figured them out.
Now I have to finish implementing it and see if I'm right.

Mainly the recognizer. Did you figure that out? I think the
important insight is that matches added by scanning and completion
operations represent a (partial) derivation step. So your left
back-pointer gives the history of the rule that was advanced, and
your right back-pointer gives a rule which was applied just before
the cursor (but only if it points to a complete match).

So then you should just be able to do a depth-first traversal of
that tree? Start with matches for the start symbol at the first
position in the final Earley state. For a left-most derivation,
follow the left back-pointer first.

If you want the prioritised-choice version, then you need to make
a choice each time you have multiple pairs of back-pointers. Look
at the right back-pointer of each and choose the one with the rule
earliest in the priority chain.

Yes? No?

I've been doing mine in Javascript and keeping an annotated
version as HTML -- sort of a literate program. So when I get it
finished, maybe I'll post that as well...we'll see how it turns
out.

--Josh

Josh Grams

2014-09-20 11:52:35 UTC

Permalink

Post by Josh Grams
Mainly the recognizer. Did you figure that out? I think the
important insight is that matches added by scanning and completion
operations represent a (partial) derivation step.

Gah. Not scanning. What am I saying? Only completion. The completion
operation represents a rule being applied all the places where it
presently fits. Yes? So if we save those pointers we can just hook them
up starting from the root(s) to get the parse forest?

Sorry, I've only spent three or four hours looking at Earley parsing:
I'm not all the way up to speed yet.

--Josh

Loup Vaillant-David

2014-09-20 12:27:04 UTC

Permalink

Post by Josh Grams
How's that for coincidence? I had just finally (on the 18th) got
around to watching Ian's "Trap a Better Mouse" talk and starting
to try it myself, and then saw that you posted this. I've done
some parsing before, so you haven't (yet) covered any of the parts
that I had trouble understanding. I *think* I've figured them out.
Now I have to finish implementing it and see if I'm right.

In any case, they should come soon. First I'll finish the recogniser
(I have to correct the empty rules bug using Aycock & Horsepool
(that's easy), and optimise right recursive rules using Leo (haven't
figured that out yet).

Post by Josh Grams
Mainly the recognizer. Did you figure that out? I think the
important insight is that matches added by scanning and completion
operations represent a (partial) derivation step.

Yes, that was a big "click" moment for me. Seeing things like
simplify the whole model, but it's not obvious. That's why I included
a page on chart parsing.

Post by Josh Grams
So your left back-pointer gives the history of the rule that was
advanced, and your right back-pointer gives a rule which was applied
just before the cursor (but only if it points to a complete match).

Actually, you don't need the back pointers. Plain Earley items are
enough. Even better, you don't need all the items. You only need the
completed ones.

Post by Josh Grams
So then you should just be able to do a depth-first traversal of
that tree? Start with matches for the start symbol at the first
position in the final Earley state. For a left-most derivation,
follow the left back-pointer first.

This business with back pointers is needlessly complicated, I found.
Plus, the way Earley items are organised, the search begins from the
end of the rules, not from their beginning. This causes problems with
disambiguation.

The easiest strategy is the following: first, ignore any Earley item
that's not completed. Second, flip the completed states.

Flipping a completed state is easy: For each item in S(i) of the form
A -> α • (j), remove the item and place A -> α • (i) in S(j). You
basically flip the start and end positions: that way, instead of
storing the partial parses where they end, you store them where they
begin.

Josh Grams

2014-09-20 15:57:58 UTC

Permalink

Post by Loup Vaillant-David
In any case, they should come soon. First I'll finish the
recogniser (I have to correct the empty rules bug using Aycock &
Horsepool (that's easy), and optimise right recursive rules using
Leo (haven't figured that out yet).

Cool. I'll look forward to it.

Post by Loup Vaillant-David
Actually, you don't need the back pointers. Plain Earley items are
enough. Even better, you don't need all the items. You only need the
completed ones.

Sure, it's just a classic space/time tradeoff. The recognizer
knows how it is deriving each match. So you can either save that
information as back pointers, or you can reconstruct the
connections later by searching.

Post by Loup Vaillant-David
This business with back pointers is needlessly complicated, I found.

Hmm...did you try drawing out where they point to? For instance,
with the example you gave back in June, if you start with the
top-level match and follow the pointers, you get this:

Loading Image...

You can see that you don't care about the contents of the left
back-pointer -- it's just reversing through the rule. But they
form the structure which puts the right sub-trees in order.

Actually, you shouldn't even need to check whether matches are
complete: if only the completion operation adds back pointers,
then you know that the right side is a complete match and the left
side isn't.

Post by Loup Vaillant-David
A -> ?? ??? (i)
A -> ?? ??? (j)
Sometimes, the same rule can span different lengths. That's typical
1. Just apply the longest match rule.
2. Dig both rules in parallel, search for the "topmost, first"
difference (whatever that means), and apply prioritised choice
there.

Oh, right. Now that you mention it, I *have* seen examples of
practical grammars with that sort of ambiguity. I don't know that
there's any good solution to that. IIRC it tends to be "real"
ambiguity where the user (the person writing the source, not the
person writing the grammar) could mean either thing. So I suspect
you have to just go with the longest match, unless you want to
provide for context-sensitive disambiguation or something...

That's my wild-ass guess, anyway. :)

--Josh

Loup Vaillant-David

2014-09-20 17:59:32 UTC

Permalink

Post by Josh Grams

Post by Loup Vaillant-David
Actually, you don't need the back pointers. Plain Earley items are
enough. Even better, you don't need all the items. You only need the
completed ones.

Sure, it's just a classic space/time tradeoff.

Not exactly. My current guess is that not using back pointer yields
simpler code. Reconstruction is simpler than back pointers
management. Performance (memory or CPU) is secondary.

Post by Josh Grams

Post by Loup Vaillant-David
This business with back pointers is needlessly complicated, I found.

Hmm...did you try drawing out where they point to?

I did. It gave me much insight about the structure of ambiguity. And
I eventually found the main problem with back pointers: they point
*back*.

By default, Earley recognisers store their Earley items in such a way
that reconstruction (or back pointer following) happens from right to
left. Which means the first ambiguities will be detected at the *end*
of the parse.

In parsing expression grammars, it's very intuitive. Imagine you
parse this rule:

A -> B C D

So you parse B, then C, then D (they are written in that order, after
all). If you detect an ambiguity in B (or anything below it),
prioritised choice will take care of it. Now imagine there are two
ways to parse the following input:

_B_ _C_ _D_ -- first possibility
/ \ / \ / \
a b c d e f -- input
\_______/ | \___/
B C D -- second possibility

The parsing expression grammar will first solve the ambiguity at the B
level. Once B is parsed, there is only one possibility. Now with
Earley items, since you start from the end, you will first deal with
D, then with C—

Aha! there's an ambiguity! So you solve it at C's level, leaving only
one possibility for B. So, instead of doing the prioritised choice at
B, where a user would expect it, you do it at C, quite possibly
yielding a different choice.

---

So I knew I had to do something to start from the beginning, instead of
from the end.

With the back pointers, it meant transforming the back pointers into
forward pointers. Say you have Three items, A, B and C. B has a back
pointer to A, and that back pointer has a down pointer to C. You need
to add a forward pointer to A that point to B. And that forward
pointer needs a down pointer that points to C. So you go from:

A <--+--- B
V
C
To

A ---+--> B
V
C

And of course, those pointers are not unique, so you need to maintain
a *list* of those… So I just threw up my arms, and tried another
approach. I thought for a moment that we *needed* those back
pointers. But then I have read that paper about chart parsing, and
realise you could do everything with completed items alone.

This has 2 advantages:

- Completed Earley items are simpler than the others, and very easy
to reverse. (Unlike the back pointer stuff.)
- No need to modify the recogniser to have your parser. They are
completely independent, making them easier to learn.

---

TL;DR: ignore Elizabeth Scott's "SPPF-style Parsing for Earley
Recognisers". It's not useful when you only want to select one parse
tree.

Loup.

Josh Grams

2014-09-21 20:51:32 UTC

Permalink

Post by Loup Vaillant-David
By default, Earley recognisers store their Earley items in such a way
that reconstruction (or back pointer following) happens from right to
left. Which means the first ambiguities will be detected at the *end*
of the parse.

But a depth-first traversal turns that around.

Post by Loup Vaillant-David
In parsing expression grammars, it's very intuitive. Imagine you
A -> B C D
So you parse B, then C, then D (they are written in that order, after
all). If you detect an ambiguity in B (or anything below it),
prioritised choice will take care of it. Now imagine there are two
_B_ _C_ _D_ -- first possibility
/ \ / \ / \
a b c d e f -- input
\_______/ | \___/
B C D -- second possibility

The left back pointers move back through the rule. So a pre-order
traversal will go like this (I'm using tildes because it was
annoying to copy/paste the bullet):

A -> B C D ~ -- at this step you apply A => B C D.
A -> B C ~ D
A -> B ~ C D
A -> ~ B C D
No more left pointers, so return back up to A -> B ~ C D.
Follow its right pointer to B ... ambiguous! So choose between:

B -> a b ~
B -> a b c ~

Then return back up to A -> B C ~ D and follow its right pointer
to deal with C, etc.

--Josh

Loup Vaillant-David

2014-09-22 01:20:02 UTC

Permalink

Post by Josh Grams
The left back pointers move back through the rule. So a pre-order
traversal will go like this (I'm using tildes because it was
A -> B C D ~ -- at this step you apply A => B C D.
A -> B C ~ D
A -> B ~ C D
A -> ~ B C D
No more left pointers, so return back up to A -> B ~ C D.

Not quite. Assume this correspondence:

A -> B C D ~ Item1 in S(start)
A -> B C ~ D Item2 and 2_bis in S(i) and S(j)
A -> B ~ C D Item3 and 3_bis in S(k) and S(l)
A -> ~ B C D Item4 in S(finish)

Looking at the left back pointers, ambiguities look like this:

+--- Item2 <------- item3 <-------+
| |
Item1 <--+ +--- item4
| |
+--- Item2_bis <--- item3_bis <---+

(Note that ambiguities always merge back.)

The only way to detect the ambiguity from Item1 to Item2, is to
construct all the relevant right pointers first.

+---> Item2 -------> item3 -------+
| |
Item1 ---+ +--> item4
| |
+---> Item2_bis ---> item3_bis ---+

That way, you can detect your ambiguity from the start. So far, so
good. But there's a catch. Sometimes, Sometimes, you'll get
something like this:

+--- Item2 <------- item3
|
Item1 <--+
|
+--- Item2_bis <--- item3_bis

item3 and item3_bis are not in the same state set (if they were, they
would be the same Earley item). So, no real ambiguity there. But if
you naively reverse the pointers:

+---> Item2 -------> item3
|
Item1 ---+
|
+---> Item2_bis ---> item3_bis

You will see an ambiguity that is not there. Actually you need *two*
Item1. One that ends where item3 is, S(k) and one that ends where
item3_bis ends, S(l). Like this:

Item1 --------> Item2 -------> item3

Item1_bis ---> Item2_bis ---> item3_bis

Reversing pointers is not enough. You have to reorganise everything.
Doing this reorganisation properly is not trivial when you have to
deal with all the Earley items, and all the back pointers. Limiting
yourself to completed states, is such a breeze that it's probably
worth the harder search that will follow.

Loup.

Josh Grams

2014-09-22 10:28:33 UTC

Permalink

Ahhh. Now I see. Thanks. So the back-pointers are only really
useful if you want the whole forest. As I think you said earlier
and I didn't listen. ;)

--Josh