MA Thesis - Hebrew grammars

What's this?

During the course of my MA, I developed a method of extracting phrase-structure grammars from the Werkgroep Informatica Hebrew Bible produced since 1977 by Prof. Dr. Eep Talstra and his group at the Free University of Amsterdam.

The WI Hebrew Bible is the masoretic text analyzed syntactically. Dr. Talstra and his group have developed their own linguistic theory of Hebrew, and analyzed the Hebrew Bible accordingly.

The analysis in the WI Hebrew Bible is somewhat coarse-grained at first sight. There are only two levels of phrases, for example, not fully recursive structures. To make matters worse, the analysis is not a strict tree.

In my MA, I developed a method to transform the WI analysis to more traditional syntax trees. The syntax trees can be seen here.

A side product of my transformation to more traditional syntax trees are the grammars described below. It stands to reason that once you have an analysis of a piece of text in tree-form, you can build a context-free phrase-structure grammar from this analysis. That is what I have done.

Grammars

Categories

I have three categories of grammars:

  1. Ones that distinguish phrases only by their phrase type.
  2. Ones that distinguish phrases by their phrase types and their functions.
  3. Ones that distinguish phrases by their phrase types and functions in the production part of each rule, but by phrase type only in each rule head.

Download

  Only phrase type PT + Func PT + Func, PT in head
Genesis 1:1-3 download download download
Genesis 1 download download download
Genesis 1-50 download download download

Number of rules

If you plot the number of rules discovered as a function of how many words you have analyzed, a rather interesting pattern emerges.

It turns out that they approximate root-functions. That is, if you plot the data on doubly logarithmic paper, you get almost straight lines with a gradient less than 1:

This gives empirical evidence that the number of phrase-structure rules discovered after analyzing x number of words grows slowly with x.

The data used for the plots is available here, along with the GNUPlot files (1, 2).