[fonc] Left-most derivation with Earley Parsing

Loup Vaillant-David

2014-06-23 01:27:27 UTC

Hello,

I am currently trying to implement Earley Parsing. My ultimate goal
is to combine all the advantages of OMeta and Earley parsing:

- OMeta can handle some context-sensitive grammars.
- OMeta's prioritised choice have obvious semantics.
- Earley work on left-recursive grammars out of the box.
- Earley doesn't require explicit look-ahead.

One crucial aspect of this is the handling of ambiguity. Basically, I
want to chose a parse tree automatically, in an obvious way, while
letting the user have some control. For this, OMeta's prioritised
choice is beautiful.

And this is where I am stuck. I don't know how to turn my heap of
Earley states into a parse tree that makes sense. I could chose an
arbitrary parse tree of course, but that wouldn't be acceptable. I
want a *predictable* choice.

I *believe* I want to find the left-most derivations that is based on
prioritised choice. But I don't know how to get it.

---

Here is an example, so you can have a feel of my progress so far.
First, One sweet ambiguous grammar:

A -> A A
A -> 'b'

Second, an input: "bbb"

The derivation I want with this grammar is this one:

A => AA => AAA => bAA => bbA => bbb

Which yields the following parse tree:

A
/ \
A A
/ \ |
A A b
| |
b b

Note that there is another leftmost derivation:

A => AA => bA => bAA => bbA => bbb

Which yields a different parse tree:

A
/ \
A A
| / \
b A A
| |
b b

The difference between the two is how I derived the symbol 'A' in the
third step: I can chose 'AA', or I can chose 'b'. If I want to
respect prioritised choice, the first parse tree is the correct one.

Alas, I don't have direct access to those derivations, or to the
resulting parse trees. Instead, I have these Earley states:

=== 1 ===
1 (1) A -> • A A
2 (1) A -> • 'b'

=== 2 ===
1 (1) A -> 'b' • (1,2| b )
2 (1) A -> A • A (1,1|2,1)
3 (2) A -> • A A
4 (2) A -> • 'b'

=== 3 ===
1 (2) A -> 'b' • (2,4| b )
2 (1) A -> A A • (2,2|3,1)
3 (2) A -> A • A (2,3|3,1)
4 (1) A -> A • A (1,1|3,2)
5 (3) A -> • A A
6 (3) A -> • 'b'

=== 4 ===
1 (3) A -> 'b' • (3,6| b )
2 (2) A -> A A • (3,3|4,1)
3 (1) A -> A A • (3,4|4,1) (2,2|4,2)
4 (3) A -> A • A (3,5|4,1)
5 (2) A -> A • A (2,3|4,2)
6 (1) A -> A • A (1,1|4,3)
7 (4) A -> • A A
8 (4) A -> • 'b'

- The 1st column is just an index.
- The 2nd column is the starting point of the state.
- The 3rd column is the grammar rule (the dot indicates where we are).
- The 4th column is a list of back pointers the states that could have
triggered the creation of the current one (If there is more than
one set of pointers, this means ambiguity.)

I *think* I understand the tree structure behind those back pointers.
But I can't find the left-most derivation I want from there. I
basically have two problems:

- These are *back* pointers. Any ambiguity is revealed from the
*right* side of the rule. I don't want a rightmost derivation, I
want a leftmost.
- Some ambiguities are not readily solvable through prioritised
choice: sometimes, the Earley sets mentioned by the child
pointers don't differ by rule, only by how much input they
consumed.

I tried not working with the back pointers directly. It is possible
for instance to construct a forward chain from the backward one. Here
is what I got:

=== 1 ===
1 (1) A -> • A A (2,2|2,1) (3,4|3,2)
2 (1) A -> • 'b' (2,1| b )

=== 2 ===
1 (1) A -> 'b' •
2 (1) A -> A • A (3,2|3,1) (4,3|4,2)
3 (2) A -> • A A (3,3|3,1)
4 (2) A -> • 'b' (3,1| b )

=== 3 ===
1 (2) A -> 'b' •
2 (1) A -> A A •
3 (2) A -> A • A (4,2|4,1)
4 (1) A -> A • A (4,3|4,1)
5 (3) A -> • A A
6 (3) A -> • 'b' (4,1| b )

=== 4 ===
1 (3) A -> 'b' •
2 (2) A -> A A •
3 (1) A -> A A •
4 (3) A -> A • A
5 (2) A -> A • A
6 (1) A -> A • A
7 (4) A -> • A A
8 (4) A -> • 'b'

It doesn't make much sense. The child pointers point to end states,
not to start states. And there is also one too many ambiguity here.
So, I figured I may need something more like the SPPF format described
by Elizabeth Scott (2008). I don't fully understand this format
however, and the algorithms she described to generate it are still
impenetrable to me. Anyway, that wouldn't be enough for my leftmost
derivation.

Ian promised us something easy to implement in his "To Trap a Better
Mouse" talk¹, but the rabbit hole seems to be a bit deeper than that.

[1]:

---

Hence my question: how do we find a leftmost derivation with Earley
parsing? Is there a paper, or a tutorial somewhere?

Thanks,
Loup.