Masaru Tomita

An Efficient Context-free Parsing Algorithm for Natural Languages and its Application Degree Type: Ph.D. in Computer Science
Advisor(s): Jaime Carbonell
Graduated: May 1985

Abstract:

This thesis introduces an efficient context-free parsing algorithm and emphasizes its practical value in natural language processing. In the theoretical worst cast analysis, the parsing algorithm occasionally are very unlikely to appear in natural languages. As far as practical natural language processing is concerned, on the other hand, the parsing algorithm seems more efficient than any existing algorithms including Earley's algorithm. Experiments with several English grammars and sample sentences show that our algorithm is 5 to 10 times faster than Earley's standard algorithm.

The parsing algorithm can be viewed as an extended LR parsing algorithm which embodies the concept of a "graph-structured stack." Unlike the standard LR, the algorithm is capable of handling arbitrary non-cyclic context-free grammars including ambiguous grammars, with little loss of LR efficiency. In particular, if its grammar is "close" to LR, most of the LR parsing efficiency can be preserved. Natural language grammars are, fortunately, considerably "close" to LR, compared with other general context-free grammars.

The algorithm is an all-path parsing algorithm; it produces all possible parse trees a parse forest in an efficient representation called a "shared-packed forest." This thesis also shows that Earley's forest representation has a defect and his representation cannot be used in natural language processing.

The last chapters of the thesis suggest practical applications of the algorithm. A concept of left-to-right on-line parsing is introduced, taking advantage of the fact that our algorithm parses a sentence strictly from left to right. Several benefits of on-line parsing are described, and its application to user-friendly natural language interface is discussed. This thesis also proposes a technique to disambiguate a sentence out of the shared-packed forest representation by asking the user questions interactively. Finally, a personal/interactive machine translation system is suggested.