During the course of my MA, I developed a method of extracting phrase-structure grammars from the Werkgroep Informatica Hebrew Bible produced since 1977 by Prof. Dr. Eep Talstra and his group at the Free University of Amsterdam.
The WI Hebrew Bible is the masoretic text analyzed syntactically. Dr. Talstra and his group have developed their own linguistic theory of Hebrew, and analyzed the Hebrew Bible accordingly.
The analysis in the WI Hebrew Bible is somewhat coarse-grained at first sight. There are only two levels of phrases, for example, not fully recursive structures. To make matters worse, the analysis is not a strict tree.
A side product of my transformation to more traditional syntax trees are the grammars described below. It stands to reason that once you have an analysis of a piece of text in tree-form, you can build a context-free phrase-structure grammar from this analysis. That is what I have done.
I have three categories of grammars:
Number of rules
If you plot the number of rules discovered as a function of how many words you have analyzed, a rather interesting pattern emerges.
It turns out that they approximate root-functions. That is, if you plot the data on doubly logarithmic paper, you get almost straight lines with a gradient less than 1:
This gives empirical evidence that the number of phrase-structure rules discovered after analyzing x number of words grows slowly with x.