An Introduction to the Link Grammar Parser


1. The logic and notation of link grammars
1.1. The basic idea
1.2. Word rules
1.3. Global rules
1.4. Link grammar in relation to other systems
2. Using the parser
2.1. Running version 4.0
2.2. Using other dictionaries
2.3. The API
3. General features of the parser
3.1. Connector subscripts
3.2. Macros
3.3. Word files
3.4. Word subscripts
3.5. The cost system
3.6. Batch mode
3.7. Commands and variables
4. Special features of the dictionary
4.1. Capitalization
4.2. Hyphenated expressions
4.3. Number expressions
4.4. Unknown words
4.5. Punctuation and affix-files
4.6. The sorting of unknown strings
4.7. The wall(s)
4.8. Idioms
5. Coordinating conjunctions
5.1. The handling of conjunctions
5.2. Uses of conjunctions
5.3. Subscripts
5.4. Some problems
6. Post-processing
6.1. The logic of post-processing
6.2. The domain structure
6.3. Groups
6.4. Post-processing rules
6.5. Post-processing and conjunctions
6.6. Link-type matching in post-processing
6.7. The "!bad" command
7. Speed and robustness features
7.1. The null-link system
7.2. The link-length limit
7.3. The post-processing limit
7.4. The timer
7.5. Panic mode
[Back to Link Grammar front page]

1. The Logic and Notation of Link Grammars

1.1. THE BASIC IDEA. Think of words as blocks with connectors coming out. There are different types of connectors; connectors may also point to the right or to the left. A left-pointing connector connects with a right-pointing connector of the same type on another word. The two connectors together form a "link". Right-pointing connectors are labeled "+", left-pointing connectors are labeled "-". Words have rules about how their connectors can be connected up, that is, rules about what would constitute a valid use of that word. A valid sentence is one in which all the words present are used in a way which is valid according to their rules, and which also satisfies certain global rules.

1.2. WORD RULES. A simple dictionary entry would look like this:

	blah: A+;

This means that if the word "blah" is used in a sentence, it must form an "A" link with another word; that is, there must be another word to the right of it with an "A-" connector. Otherwise the sentence is not valid. The expression following the colon is the "linking requirement" for the word.

A word may have more than one connector that has to be connected. This would be notated as

	blah: A+ & B+;

A word may have a rule that either one of two (or one of several) connectors can be used, but exactly one must be used. In the dictionary, we notate this as

	blah: A+ or B-;

This means that if the word can make either an "A" link to the right, or a "B" link to the left, its use in the sentence is valid; but it must make one or the other, and it can not make both.

These rules can be combined. For example, consider the following notation:

	blah: A+ or (B- & C+);

This means that the word must make either an "A" link to the right, or a "B" link to the left and a "C" link to the right. No other combination will be valid.

Such expressions can be nested without limit, such as

	blah: (A+ or B-) & ((C- & A+ & (D- or E-)) or F+);

Some connectors are optional; this is notated with curly brackets. For example:

	blah: A+ & {B+};

This means the word must make an "A" link to the right, and it can make a "B" link to the right but does not have to. Curly brackets can also be put around complex expressions, like

	blah: (A+ or B+) & {C- & (D+ or E-)};

An equivalent way of writing an optional expression like "{X-}" is "(X- or ())". This can be useful, since it allows a cost to be put on the no-link option (see Section 3.5).

A word can also make an indefinite number of links of the same type to other words. For this, we use the "multi-connector" symbol "@". For instance, the word below could make any number of F links to words to the right (but is not required to make any).

	blah: (A+ or B+) & {C- & (D+ or E-)} & {@F+};

(If a word has "@A+", with no curly brackets, it is required to make at least one A+ link to the right; any others are optional.)

The ordering of elements in the "connector expression" is important. What that dictates is the relative closeness of the words that are being connected to. The further to the left the connector name, the closer the connection must be. For example,

	blah: A+ & B+;

This means that "blah" must make an "A" link to the right and a "B" link to the right, and the word it makes the "A" link with must be closer than the word it makes the "B" link with.

This only pertains, however, to connections in the same direction. For connectors pointing in opposite directions, the ordering is irrelevant. Therefore

	blah: A+ & B-;

means exactly the same thing as

	blah: B- & A+;

For that matter,

	blah: A- & B+ & C+ & D-;

means exactly the same thing as

	blah: B+ & C+ & A- & D-;

For "or" expressions, such as "A+ or B+", the ordering of the elements is irrelevant.

A dictionary entry thus consists of a word, followed by a colon, followed by a connector expression, followed by a semi-colon. The dictionary consists of a series of such entries. Any number of words can be put in a list, separated by spaces; they will then all possess the linking requirement that follows:

	blah blee blay: A+;

A connector name must consist of one or more capital letters (any number may be used), followed by "+" or "-".

We should mention one concept here that plays an important role in the internal workings of the parser: the "disjunct". A disjunct is a set of connector types that constitutes a legal use of a word. The dictionary expression for any word can be represented as a set of disjuncts. If a word has the following expression:

	blah: {C-} & (A+ or B+);
then it has the following four disjuncts:
        C- A+ 
        C- B+

These disjuncts represent all the legal uses of the word "blah". Using C- and A+ is a legal use of the word; using A+ and B+ is not. Disjuncts play an important role in the internal workings of the parser.

1.3. GLOBAL RULES. As well as these "word rules", which are specified in the dictionary, there are two other global rules which control how words can be connected.

First of all, links can not cross. For example, the following way of connecting these four words (connecting "cat" to "dog" and "horse" to "fish") would be illegal. The parser simply will not find such linkages.

                 +---- | -----+     |
                 |     |      |     |
		cat   horse  dog  fish

This is the "crossing-links" (or "planarity") rule. Secondly, all the words in a sentence must be indirectly connected to each other. Therefore the following way of connecting these four words would be illegal (if it was the entire linkage).

                 +-----+     +----+
                 |     |     |    |
		cat  horse  dog  fish

This is the "connectivity" rule. A valid sentence is therefore one which can be linked up in a way that a) all the words are used in a way that satisfies their linking requirements; and b) the crossing- links and connectivity rules are not violated.

1.4. LINK GRAMMAR IN RELATION TO OTHER SYSTEMS. The structure assigned to a sentence by a link grammar is rather unlike any other grammatical system that we know of (although it is certainly related to dependency grammar). Rather than thinking in terms of syntactic functions (like subject or object) or constituents (like "verb phrase"), one must think in terms of relationships between pairs of words. In the sentence below, for example, there is an "S" ("subject") relation between "dog" and "has"; a "PP" (past-participle) relationship between "has" and "gone"; and a "D" (determiner) relation between "the" and "dog". (Ignore the lower-case letters for the moment; they will be explained below.)

       |     +---A--+-Ss-+-PP-+
       |     |      |    |    |
      the black.a dog.n has gone 

Where possible, we try to give link-types names that have mnemonic significance in this way.

It may be seen, however, that parts of speech, syntactic functions, and constituents may be recovered from a link structure rather easily. For example, whatever word is on the left end of an "S" link is the subject of a clause (or the head word of the subject phrase); whatever is on the right end is the finite verb; whatever is on the left-end of a D link is a determiner; etc.. Moreover, all nouns, verbs, and adjectives in the dictionary are subscripted (as ".n", ".v", or ".a"--see section 3.4), so in these cases the syntactic category of the word is made explicit.

The constituent structure of sentences, while not absolutely explicit, is also quite "close to the surface" in linkage structures. This requires some explanation. Imagine a linkage as a graph through which paths can be traced, similar to a street map. Constituents can be defined as sets of words which can be reached from certain links, tracing in a certain direction. For example, a verb phrase is everything reachable from an "S" link, tracing to the right--that is, not tracing through the left end of the S link itself. In the sentence below, "have changed during the past 3 1/2 years" is a verb phrase, since all these words can be reached from an S link (between "addresses" and "have") going to the right. "Would be an administrative nightmare" is _not_ part of this verb phrase since these words cannot be reached from the S link in question, except by tracing through the left end of the link ("addresses"). For noun phrases there are several possibilities. Anything that can be reached from an O link tracing right is an NP, (such as "an administrative nightmare") as is anything reachable from a J link tracing right ("the two million customers whose addresses have changed during the past 3 1/2 years").

  |                                                      +--------------------
  +--------------Wd--------------+                       +---------------Opn--
  |   +------------Ds------------+                       |             +----DD
  |   |       +----G----+---AN---+----Ss---+--TH-+---Ce--+----K---+    |   +--
  |   |       |         |        |         |     |       |        |    |   |  
///// a Commonwealth Edison spokesman.n said.v that tracking.g down.e the two 

-------------+                                            +-----------Jp---
---+         |                                            |    +-----DD----
NN-+---Dmc---+---Mr---+--Dmcw--+----Sp---+---PP--+---MVp--+    +--L--+   +N
   |         |        |        |         |       |        |    |     |   | 
million customers.n whose addresses.n have.v changed.v during the past.a 3 

---------------+                                         |
-------+       |      +------------Ost------------+      |
-+     |       |      |   +-----------Ds----------+      |
N+-Dmc-+       +--Ix--+   |         +------A------+      |
 |     |       |      |   |         |             |      |
1/2 years.n would.v be.v an administrative.a nightmare.n . 

With version 4.0, we have incorporated a system for deriving a traditional constituent representation of a sentence from a linkage. Click here for more information about this; you can experiment with this system at the online demo page.

2. Using the Parser

2.1. RUNNING VERSION 4.0. To run version 4.0 of the parser on a Unix platform, you must first construct your own copy of the directory "link-4.0" that is on our FTP site. You will find a "Makefile", a "README" file and a handful of subdirectories. Compile the program by typing "make" while in the parse directory.

To actually run the program along with the dictionary (and related files) that we provide, type "./parse". The parser will then give you a prompt:


Type in a sentence, and press RETURN. The parser will tell you whether that sentence has a valid linkage or linkages, given the dictionary it is using, and it will output the linkage that it finds, showing the words that are linked together and the type of link between them. The parser will then prompt you for another sentence. To exit the parser, type "quit".

For each sentence,the parser will do an exhaustive search for linkages; it will generate all valid linkages. It will begin by displaying the lowest-cost linkage it finds. Other linkages may then be seen, one at a time, by pressing RETURN. (The ordering of the output is determined by the cost system: see Section 3.5.) If the parser cannot find any complete linkages of a sentence, it may simply stop and say "No complete linkages found"; or it may try to parse the sentence, ignoring some of the words. This depends on whether "null-links" are in use; see section 7.1.

2.2. USING OTHER DICTIONARIES. The more general way of running the parser is by typing

         ./parse [dictionary] -pp [post-process file] -c
         [constituent-knowledge file] -a [affix-file]

where "dictionary" is the name of whatever dictionary you wish to use. See section 6 for a discussion of post-process files, and section 4.5 for a discussion of affix files. The constituent-knowledge file is the post-processing file used in deriving the constituents; click here for information about this. The default files (the ones used if none are specified) are 4.0.knowledge, 4.0.constituents. and 4.0.affix.

If you wish to use no post-process file, thus by-passing post-processing altogether, type "./parse [dictionary] -ppoff". You might wish to do this in using "tiny.dict", for example, since this dictionary is not designed to require any post-processing. You can specify no affix file by typing "-aoff", and no constituent file by typing "-coff".

2.3. THE API. We have developed an API for designing customized versions of the parser and incorporating them into applications. It is important to realize that version 4.0 of the parser is just one way of configuring the API. In the following sections, we discuss many features of version 4.0 of the parser. Some of these features can be modified by editing the dictionary, as we explain here. Other features can be modified by configuring the API differently. For more about the API, see our API documentation.

3. General Features of the Parser

3.1. CONNECTOR SUBSCRIPTS. In general, a connector may only link to another one with the same name, i.e., the same string of capital letters. However, there is another way of controlling how connectors may link to each other, using connector subscripts. A subscript is a lower-case letter following a connector-name, like "Ss+". An "Ss+" connector can connect with an unspecified "S-" connector, or an "Ss-" connector, but not with an "Sp-" connector.

Connector types may have multiple subscript characters, such as "Spa+". An "Spa+" can connect with an "S-", an "Sp-", or an "Spa-", but not with an "Ss-" or an "Ssa-" or an "Spb-".

An "*" subscript type is a "wildcard" that can connect with anything. Therefore, an "S*+" is exactly the same as an "S+". An "S*a+" can connect with an "S-", an "Ss-", an "Sp-", or an "Ssa-", but not with an "Ssb-".

3.2. MACROS. It is possible to define a single symbol as a longer connector expression, and then use that symbol to refer to the longer expression in the dictionary. To do this, simply choose a name for the longer expression, and surround it with angle brackets (<>). Then treat it like a word in the dictionary; list the name, then a colon, then the connector expression that it should stand for. For example, we define "<noun-main-s>" in the dictionary as follows:

	<noun-main-s>: (Ss+ & <CLAUSE>) or SIs- or Js- or Os- or 
		({[Bsj+]} & Xd- & Xc+ & MX-);

We then use this symbol in many other actual word definitions.

We use many of these macros in the dictionary, to reduce redundancy; there are many connector expressions that are used over and over in longer expressions. Here are a few common ones:

<noun-main-...>: the "main" connectors for nouns, used to link them to the rest of the sentence (as subject, object, etc).

<noun-sub-...>: the "sub" connectors for nouns, used to link them to modifiers like prepositional phrases and relative clauses.

<verb-...>: These macros are for verbs; they distinguish different forms of the same verb. That is, they contain connector types - like S-, PP-, etc. - that distinguish different forms of the same verb. <verb-s> is for singular verbs, <verb-pp> for past participles, <verb-sp,pp> for forms which are both simple past and past participle, etc..

<vc-...>: These macros are for verb complements; they stand for different complement expressions. Some verbs can connect to a direct object, using O+; some can connect to an infinitive verb, using TO+; and so on.

3.3. WORD-FILES. The most basic way to write the dictionary is to list all the words in a particular category, followed by a colon, followed by their connector expression. There is another way, however. One can put all the words in a category in a file, choose a name for the file, and put that filename in the dictionary in place of the list of words. When listed in the dictionary, the filename must be preceded by a slash (/).

Here are the word files that are in use at the moment:
words.n.1 singular countable (i.e. not mass) nouns
words.n.2.s plural nouns ending in "s"
words.n.2.x plural nouns not ending in "s"
words.n.3 mass nouns
words.n.4 nouns that may be mass or countable
words.n.p proper names that are also ordinary words when not capitalized (see Section 3.1 for explanation)
words.n.t nouns that can be used as titles, like "president"

(In the following verb files, the final number indicates the verb form. ".1" is for infinitive-plural forms, ".2" is for singular forms, ".3" is for simple-past / past-participle forms, ".4" is for present participles, ".5" is for gerunds. On intransitive verbs, the present participle and gerund expression are combined into a single dictionary entry.)
words.v.1.(1-4) intransitive verbs
words.v.1.p special two-word passives ("lied_to_", "paid_for")
words.v.2.(1-5) optionally transitive verbs
words.v.4.(1-5) transitive verbs
words.v.5.(1-4) intransitive verbs that may form two-word verbs with particles like "up" and "out"
words.v.6.(1-5) optionally transitive verbs that may form two-word verbs
words.v.8.(1-5) transitive verbs that may form two-word verbs
words.v.10.(1-4) verbs that may be used in quotation expressions, like "said" ("John is here, he said").
words.adj.1 ordinary adjectives, with no special complements
words.adj.2 ordinary comparative adjectives (e.g. "bigger")
words.adj.3 ordinary superlative adjectives (e.g. "biggest")
words.adv.1 ordinary manner adverbs ("quickly", "angrily")
words.adv.2 ordinary clausal adverbs ("fortunately")
words.adv.3 adverbs like "chemically"
words.y common year numbers ("1990", etc.)
words.s US state names and abbreviations

3.4. WORD SUBSCRIPTS. A single word can be given several different dictionary entries. To do this, the entries must be distinguished by giving the words different subscripts. Words may be followed by a subscript such as ".n". For example:

	run.n: A+ or B+...
	run.v: C+ or D+...

If a word is listed more than once with the same subscript, or if it listed once with a subscript and once without, the parser will generate a warning message and will ignore one of the entries.

The parser starts at the right end of every string of characters. Any sequence of letters to the right of the right-most period in the string will be considered the subscript. (Periods at the end of a string are simply considered part of the string; see Section 4.5 below.)

In searching for linkages, the parser will consider each entry for the word as a different word, and will generate all linkages found for all entries. The subscript is shown in the display, thus indicating which entry the parser chose for a particular linkage.

The main word subscripts we use are ".n" for nouns, ".v" for verbs, and ".a" for adjectives. All nouns, verbs, and adjectives are subscripted in this way. Certain other subscripts are used only when needed to distinguish two forms of the same word: ".e" for adverb ".p" for preposition, ".s" for singular, ".p" for plural, ".t" for title.

3.5. THE COST SYSTEM. We have a system for assigning a cost to a linkage. This allows the parser to express preferences among the linkages it finds. The cost system uses square brackets ("[" and "]"). If a connector, or a series of connectors, is surrounded by square brackets, it is assigned a cost. The amount of cost is equal to the number of square brackets on each side: [A+] will receive a cost of 1; [[A+]] will receive a cost of 2; etc.. The parser uses this cost as a criterion for deciding which linkage to output first; it outputs them in order of cost (i.e., lowest cost first).

At the moment, connectors with a cost of 0, 1 or 2 are considered in normal parsing; connectors with a cost of 3 are considered only in "panic mode". See Section 7.5.

Given several linkages of the same cost level, the parser has certain heuristics for choosing the best parse, i.e., the one to output first. It prefers the linkage in which the total length of the links is lowest; and in sentences with conjunctions, it prefers a linkage where the lengths of the conjoined word-lists are similar (see section 5). This information is indicated in the cost vector shown above the linkage:

  Unique linkage, cost vector = (UNUSED=1 DIS=0 AND=0 LEN=1)

"DIS" is the connector cost or disjunct cost for the linkage (the "[]" system explained above); "AND" is the difference in length between and-list elements; and "LEN" is the total length of all links in the sentence (minus the number of words--since the total link length is never less than the number of words). "UNUSED" indicates the number of null-links; see section 7.1.

3.6. BATCH-MODE. It is possible to make a file of sentences, and then run them through the parser all at once. Simply create a file, with one sentence on each line. When running the parser, type

	parse [dictionary name] < [filename]

When inputting a file of sentences, it is sometimes useful to use another feature known as "batch-mode". In this mode the user indicates in the input file which sentences should be rejected, and which should be accepted; the parser then notifies the user when its judgments differ. To say that the parser "rejects" a sentence means that it does not find any linkages without null-links. (Batch-mode does not allow null-links; the "!batch" command automatically turns null-links off. See section 7.1 for an explanation of null-links.) To prepare a file for use in batch-mode, put "!batch" at the top of the file. Then, for each sentence that should be rejected, precede the sentence with "*" in the file. If it should be accepted, precede it with nothing. After processing a batch file, the parser will then print the number of errors in the file: i.e., the number of sentences on which its judgments differ with the judgments indicated by the symbols.

The symbol ":" can also be used at the beginning of a line. This means "linkage should be found with non-zero cost". If a linkage is found with zero cost, or no linkage is found, an error will be reported. (An absence of symbol means "linkage should be found, either with cost or without".)

When running batch-mode, the parser will ordinarily output only the number of errors it makes. Thus it is primarily useful for checking sentences, to see if a particular dictionary produce the desired results (or to make sure that no earlier work has been broken by recent changes). If one begins the batch file with the command "!echo", the parser will also output the sentences as it parses them, as well as display information for the sentences on which its judgments disagree with the user's.

Our FTP directory contains a large batchfile entitled "4.0.batch". This file contains about 900 sentences which systematically test many features of the parser. Many sentences are incorrect, and are marked with "*". With version 4.0, this file should parse with 0 errors.

3.7. COMMANDS AND VARIABLES. It is possible to modify the running of the parser in various ways, while running it, by typing in certain commands. The basic commands can be seen by typing "!help". Others are listed under "!variables". Many of these are self-explanatory. For example, "!width" changes the width of the parser display. Other commands relate to speed and robstness features; see section 7.

A few commands deserve special mention. One useful command is "!![word]". This queries the parser for information about a particular word. The parser will output list any entries of the word, with their word subscripts, the word-files in which they appear, if any, and the number of disjuncts on each word. (A disjunct is a combination of connectors which constitutes a legal use of the word.) Multiple entries of a word will be listed with their word subscripts.

The "!verbosity" command controls the amount of information that is displayed. With "!verbosity=1" (the default), information such as the following is shown:

linkparser> the quick brown fox jumped over the lazy dog
++++Time                                          0.04 seconds (0.04 total)
Found 2 linkages (2 had no P.P. violations)
  Linkage 1, cost vector = (UNUSED=0 DIS=0 AND=0 LEN=18)

 +---------Ds---------+              +-------Js------+
 |     +-------A------+              |   +-----Ds----+
 |     |       +---A--+---Ss--+--MVp-+   |     +--A--+
 |     |       |      |       |      |   |     |     |
the quick.a brown.a fox.n jumped.v over the lazy.a dog.n

Press RETURN for the next linkage.

With "verbosity=0", no information is shown except for the graphic linkage display. With verbosity set at 2 or 3, information is shown about the individual stages of parsing the sentence. (Infomation is also shown about the constituent derivation process, if this is being done.) If one wants to suppress the graphic display as well, this can be done with the command "!graphics". (This can be useful it one wants to have only the constituent bracketing as output; in that case, type "!verbosity=0", "!graphics", and "!constituents=1 (or 2)".

4. Special Features of the Dictionary

4.1. CAPITALIZATION. The parser respects capitalization: that is, the use of upper- and lower-case letters. If a string is listed in the dictionary beginning with a capital letter, then a word that is inputted will only match it if it has the same capitalization. (The same with strings with capital letters in the middle, although this is probably of little use.) However, there are a few special cases here.

There is a general category in the dictionary called "CAPITALIZED_WORDS". This is the default category for words whose first character is capitalized. Any such word which is inputted which is not explicitly listed in another category will be assigned to this category. This is of course useful, since most capitalized words are names which are grammatically all the same.

A special situation occurs with words at the beginning of the sentence. If a sentence-initial word has an uncapitalized first letter, it is treated in the normal manner. If it is capitalized, the parser will first look to see if it is listed in the dictionary as a either a capitalized word or an uncapitalized word. If not, it will then assign it to the generic "CAPITALIZED_WORDS" category. (If the word is listed both as a capitalized word and an uncapitalized one, the parser will try to use it in both ways. Because there are certain words which are also common names, like "Will" and "Rob", we have created a special category for them, so that when they are used sentence-initially, they will be recognized as possible names.)

The situation at the beginning of the sentence also applies after a colon. Sometimes, after a colon, the following word is capitalized as if it was the beginning of a sentence; the parser recognizes this. So, for example, the following sentence is accepted: "The problem is this: The dog ran."

4.2. HYPHENATED EXPRESSIONS. The dictionary also contains a special category called "HYPHENATED_WORDS". If a string contains a hyphen, and it is not listed in the dictionary, the parser will assign it to the category "HYPHENATED_WORDS". This is, again, useful, since hyphenated words are used somewhat "productively", and it would be very difficult to list them all.

4.3. NUMBER EXPRESSIONS. The dictionary contains a category "NUMBERS". Any numerical expression -- that is, a string consisting entirely of numerical characters -- will be assigned to this category unless it is explicitly listed elsewhere in the dictionary. (The string may also contain a period, i.e. a decimal point, or a comma, as in "3,287". It may also contain a colon; thus time expressions like "4:30" are treated as generic numbers.)

4.4. UNKNOWN WORDS. The dictionary also permits a feature known as "unknown words". A category can be defined using the string "UNKNOWN-WORD.x", where x is any subscript. If a word beginning with a lower-case letter is typed in that is not recognized, it will be assigned to that category. The word is then displayed with a question-mark in brackets, like "blah" below:

	  |    +---D--+--Ss--+-Pp+  
	  |    |      |      |   |  
	///// the blah[?].n is here 

Several different unknown word categories may be generated, labeled with different subscripts: for example, corresponding to nouns, verbs, and adjectives and adverbs. (These are the four categories we use, labeled .n, .v, .a, and .e, respectively.) The parser will search for all linkages that can be found using each entry. If it only finds a linkage for the "noun" category, then the output will show the unknown word labeled ".n": in effect, the parser is then guessing that the word is a noun.

Version 4.0 of the parser has an new feature for handling unknown words, known as "morpho-guessing". This is a system for guessing the syntactic category of an unknown word (that is, a word not explicitly listed in the dictionary) based on its spelling. Words that end in "-s" are assumed to be plural nouns or singular verbs; these are assigned to a category listed as "S-WORDS" in the dictionary. Similarly, words ending in "-ed" are assumed past-tense (or passive) verbs; those ending in "-ing", present participles; those ending in "-ly", adjectives. This greatly improves the ability of the parser to handle sentences containing multiple unknown words. Words that have been treated in this way are marked with a "[!]". Click here for more information about this new feature.

4.5. PUNCTUATION AND AFFIX-FILES. The parser is capable of handling a variety of punctuation symbols. There are two issues to be discussed here. One is the listing of symbols in the dictionary; the other is the way they are "read" by the parser when they are used in sentences.

Punctuation symbols can be listed in the dictionary just like words, and given ordinary linkage expressions. The same is true for strings containing multiple punctuation symbols or a mixture of letters and punctuation. The problem here is that certain punctuation symbols are also used as the "syntax" of the dictionary: colons, semi-colons, ampersands, etc.. Our solution to this is as follows: when listing these special characters, or a string containing them, one must put them in quotation marks:

	";": A+ or B-;
	"+": C+ or D-;

(The special characters that must be treated this way are precisely those which are used in the dictionary in a "syntactic" way: "(", ")", "{", "}", "[", "]", "@", "%", "&", "*", "+", "-", "/", "<", ">".)

When punctuation symbols are used in sentences, they will be used in linkages according to the connector expressions listed in the dictionary, in the normal way. There is a difference, however. It may be noted that although many punctuation symbols are similar to words in the ways they are used, they are often not separated from preceding or following words by spaces. In order for these symbols to be recognized as separate units, then, they must be "stripped off": that is, a space must be inserted between the symbol and the adjacent word.

With version 4.0 we provide a new system for handling this "stripping off" process. Rather than being hard-wired into the system, it can now be easily modified and expanded by the user, using an "affix file" which is read in by the parser when the program is run. An affix file is similar in format to a link grammar dictionary, with lists of strings (they need not be individual symbols, but may be any length), followed by a connector type. Strings to be stripped off from the left are listed under the category "LSTRIP+"; those to be stripped off from the right are listed under the category "RSTRIP+". In the affix-file that we provide, for example, "$" is listed as a "LSTRIP+" symbol; "," is listed as a "RSTRIP+" symbol. (Click here for more information about affix files.)

In the default affix file we provide (4.0.affix), the "RSTRIP+" strings include these:

, . ! ? % ) : ; ' 's 're 'll 've 'd

The "LSTRIP+" strings include these:

$ (

Note that we also use the affix file for a few things besides punctuation. The "right-stripped" list includes strings like "'s", representing a possessive ("John's") or a contraction ("he's"); strings like "'re" (as in "you're") are also included, representing the fact that these should really be treated as independent words. A further point: the parser will perform this stripping process on a string repeatedly, if necessary, perhaps stripping off several characters. For example, it will convert the first sentence below into the second:

John, a professor (who got a raise of 5%), is here

John , a professor ( who got a raise of 5 % ) , is here

There is one difference between right-stripped and left-stripped characters. Right-stripped characters can also be used at the right end of a string defined in the dictionary; for example, one could actually define "Mrs." or "it's" as ordinary dictionary entries. Before stripping off right-stripped characters, the parser will make sure that the string is not present in the dictionary as it appears.

If punctuation symbols are used in the middle of strings, they will be treated just like any other symbols. (And if they are used in sentences in undefined strings, they will be treated as some kind of unknown string.) (Note that periods are special, however; if the string ends with a period followed by a letter, this will be interpreted as the word subscript, not part of the string. See section 3.4 above.)

One exceptional case is quotation marks. Quotation marks may not be defined in the dictionary; and they are simply ignored when they are used in sentences. This is sufficient to handle most uses of quotes; generally, the presence of quotes does not affect the well-formedness of sentences, and it is often only subtlely affects meaning. However there are a few constructions, such as the last pair of sentences below, which seem to be only correct when quotes are included.

"John is leaving," she said
John is leaving, she said
I am meeting with my "advisor" today
I am meeting with my advisor today
She said, "John is leaving".
?She said, John is leaving.

We are unable to control such usages at the moment.

The percent sign ("%") is used for comments. If a % appears at the beginning of a sentence inputted to the parser, that sentence will simply be ignored. This is useful if one is inputting a file of sentences, and one wants to include some comments in the file. A % at the beginning of a line can also be used for putting comments in a dictionary.

4.6. THE SORTING OF UNKNOWN STRINGS. Notice that the parser must make decisions about how to handle strings of characters which are not explicitly listed in the dictionary. The ordering of the decisions is important. At the moment, the parser proceeds as follows:

  1. If the string is the first string in a sentence, and its first letter is capitalized, then generate the lower-case-first-letter form of the string, and perform the following steps on both the capitalized and lower-case forms.
  2. Strip off any "strippable" strings from the left end of the string (as described above). Perform subsequent steps on both the stripped string and the remainder.
  3. Look up the string in the dictionary as is. If there is an exactly matching string in the dictionary, use that.
  4. Strip off any "strippable" punctuation string from the right end of the string (as described above). Repeat step 3 on both the punctuation string and the remaining string. If there is more than one such string, repeat steps 3 and 4 repeatedly.
  5. If the string begins with a capital letter, assign it to the category "CAPITALIZED-WORD".
  6. If the string contains a hyphen, assign to the category "HYPHENATED-WORDS".
  7. If the string consists entirely of numbers, assign it to the category "NUMBERS".
  8. If the string still has not been assigned, see if it matches any of the "guessed-word" categories (S-WORD, etc.); if so, assign it to this category.
  9. If the string still has not been assigned, assign it to any "UNKNOWN-WORD" categories.
  10. If none of these alternatives are available, the parser will say "the following words are not in the dictionary: [whatever]", and will give no other output for the sentence.

If at any point the parser finds more than one way of analyzing a string, it generates both of them, and tries parsing the sentence with both forms of the string. This might happen a) if there are multiple forms of the word in the dictionary (with different subscripts); b) if the string is capitalized and occurs at the beginning of the sentence, and both the capitalized and lower-case forms are listed in the dictionary; or c) if there is more than one UNKNOWN-WORD category.

Any of the generic word categories - UNKNOWN-WORD (any or all of them), S-WORD (or the other "guessed word" categories), CAPITALIZED-WORD, HYPHENATED-WORD, or NUMBERS - may be omitted from the dictionary. In this case, the parser will skip over the corresponding step in the procedure above. For example, if "HYPHENATED-WORD" is omitted from the dictionary, and an unknown hyphenated word is entered, it will be treated like an UNKNOWN-WORD. If no UNKNOWN-WORD categories are present, and a string is entered which fits none of the other categories, the parser will proceed to step 9 above.

4.7. THE WALL(S). It proved to be useful to imagine that there was a dummy word at the beginning of every sentence. We call this "the wall". The wall has a linking requirement like any other word; it is listed in the dictionary under "LEFT-WALL". If this entry is included in the dictionary, the wall will be automatically inserted at the beginning of every sentence. Because of the connectivity rule, it is then necessary for the wall to be linked to the rest of the sentence in order for the sentence to be valid.

There is also a "right-hand wall", which is similar to the original wall at the other hand of the sentence. This is only needed for certain punctuation phenomena. In most sentences, we use a special "RW" connector to simply connect the left hand wall to the right hand one. The right-wall's dictionary entry is "RIGHT-WALL". (Since the left-wall is much more important than the right-wall, we often refer to the left-wall simply as "the wall".)

In most sentences, the left-wall connects to the sentence with a "Wd" link, and the right-wall connects to the left-wall with "RW". When only these connectors on the walls are being used, they are not displayed in the linkage diagram. When other connectors on the walls are being used, instead or as well, the walls are shown. (For example, the left-wall is shown in questions and imperatives.) To make it so that the walls are _always_ shown, type "!walls".

Either wall can be deactivated by simply removing the "LEFT-WALL" or "RIGHT-WALL" entry from the dictionary.

4.8. IDIOMS. A string of words can be defined as a single dictionary entry. To do this, simply join the words together with underbars:

	a_la_mode: A+ or B-;

Most idioms can be interpreted either as a single "idiom" or as a string of words (for example, "in question"). In this case, the parser will find all linkages with both interpretations.

In reading idiomatic strings from the dictionary, the parser breaks them up into individual words and assigns them "dummy" link-types which simply link the words of the idiom together in series. These link-types are assigned four-letter names of the form ID[X][Y], where X and Y are arbitrary letters.

Idioms cannot be given subscripts; if "a_la_mode.a" is included in the dictionary, this will not be accepted. However, an idiom can be listed in the dictionary more than once, without subscripts.

5. Coordinating Conjunctions

Coordination constructions do not fit naturally into the framework of link grammars. We have devised a method for automatically transforming the given link grammar into another one that captures the desired phenomena. (It involves internally generating special links for use with conjunctions, which we call "fat links".) This system is hard-wired in, and cannot easily be modified by the user. However, it has proven to be effective in handling the vast majority of uses of conjunctions. Our discussion will focus on the word "and", although the ideas apply to the use of "or", "but", "either-or", "neither-nor", "both-and", and "not only - but".

5.1. THE HANDLING OF CONJUNCTIONS. We begin by proposing a simple definition of the use of "and" within the framework of link grammars. Then we'll mention a few problems with the definition, and suggest an improvement. The second definition is the one used in our system. It has drawbacks, but on balance it has proven to be remarkably effective.

Given a sequence S of words containing the word "and", a "well-formed 'and' list" L is a contiguous subsequence of words satisfying these conditions:

1. There exists a way to divide L into components (called "elements" of the well-formed "and" list) such that each element is separated from its neighboring elements by either a comma or the word "and" (or both). (The comma and the "and" are not part of the element.) The last pair of elements must be separated by "and" (or a comma followed by "and"). For example, in "The dog, cat, and mouse ran", "dog", "cat", and "mouse" are the elements of the well-formed "and" list "dog, cat, and mouse".

2. Each of the sequences of words obtained by replacing L (in S) by one of the elements of L is a sentence of the link grammar.

3. There is a way of choosing a linkage for each of these sentences such that the set of links outside of the "and" list elements are exactly the same in all of the sentences, and the connectors joining the sentence with its "and" list element are the same. In other words, if we cut the links connecting the element to the rest of the sentence, remove that element from the sentence, and replace it by one of the other elements, then the cut links can be connected to the element so as to create a valid linkage.

The sequence S is grammatical if each instance of "and" in it is part of a well-formed and list.

For example, consider the sentence "We ate popcorn and watched movies on TV for three days." The the phrase "ate popcorn and watched movies on TV" is a well-formed "and" list because it can be divided into elements "ate popcorn" and "watched movies on TV ", which satisfy all of the conditions above. The following two linkages show this. Note that in both cases the element is attached to the rest of the sentence with an "S" to the left and an "MV" to the right.

  +-Wd-+Sp+--Os--+                                  |     +--Dmc-+     
  |    |  |      |                                  |     |      |     
///// we ate popcorn.n                            for.p three days.n  

  +-Wd-+----------Sp----------+---Op---+--Mp-+Js+   |     +--Dmc-+  
  |    |                      |        |     |  |   |     |      |  
///// we                   watched movies.n on TV for.p three days.n

There is a major problem with this definition. It contains no requirement that the words of an element of an "and" list be connected to each other, nor be related in any way (other than being contiguous). This allows many clearly ungrammatical sentences to be accepted, and generates numerous spurious linkages of correct sentences. For example, it would imply that "I like the beer John and wine Harry drank" is a valid sentence.

We have two techniques to limit the set of sentences deemed grammatical by this rule. The first is to simply restrict the types of connectors that can connect the element of the "and" list to the rest of the sentence. The list of connectors allowed to do this is contained in the list "ANDABLE-CONNECTORS" in the dictionary. (If a connector type is included in this list, this means, in effect, that several of them may be joined to a connector of the opposite type. So, including "S-" on this list allows "John ran and skipped".) See Section 5.2 for further discussion of "andable connectors".

The second method is to restrict the definition of a well-formed "and" list. Say that a well-formed "and" list is a "strict and list" if it also satisfies the following condition: Each element must be connected to the rest of the sentence through exactly one of its words. (It may use many connectors.)

This is the system that we have implemented. This logic of dealing with conjunctions is reflected in the parser's output. A sentence with conjunctions is outputted showing the sentence split up into several sub-sentences:

          +--------S----------+-MV+  +-D-+
          |                   |   |  |   |
        John, Dave, and Fred ran in the park

               +---S----------+-MV+  +-D-+
               |              |   |  |   |
        John, Dave, and Fred ran in the park

                          +S--+-MV+  +-D-+
                          |   |   |  |   |
        John, Dave, and Fred ran in the park

Alternatively, the conjunctive linkages can be merged into one in the display. To use this mode, type "!union". For the above sentence, "!union" mode yields this:

        |      +-------Ss------+    +---Js--+
        |      |          +-Ss-+-MVp+  +-Ds-+
        |      |          |    |    |  |    |
      John , Dave , and Fred ran.v in the park.n 
In some cases, "!union" mode may result in "crossing" links - the one situation where this is possible:
       |    +-------Ss------+
       +-Ds-+         +--Ss-+
       |    |         |     |
      the cat.n and dog.n ran.v 

Of course, several conjunctions may occur in a sentence: "John and Fred ran and jumped". In such cases, sub-linkages will be generated for each combination of and-list elements: "John ran, "John jumped", "Fred ran", and "Fred jumped". Nested "and" structures are also allowed, like "The people and their sons and daughters were there". In sentences containing several conjunctions, a large number of sub-linkages may be generated for a single linkage. For this reason, it may be preferable to use the "!union" display.

Conjunctions are a frequent source of ambiguity. For example, in the sentence "Several big cats and dogs with sharp teeth chased me", "several" may or may not apply to "dogs" (as a plural noun, "dogs" does not require a determiner); "big" may or may not apply to dogs; and "with sharp teeth" may or may not apply to cats. Linkages for all of these possibilities will of course be generated.

5.2. USES OF CONJUNCTIONS. The implementation of conjunctions is "hard-wired" in, and cannot be easily modified. However, it covers the vast majority of uses of coordinating conjunctions. First of all, it allows a wide variety of connector types to be used with conjunctions. As mentioned above, this is controlled by the connectors listed in the "ANDABLE-CONNECTORS" list in the dictionary. It can be seen that most common connector types (both "+" and "-" forms) are included on the list, permitting a variety of conjunctive expressions. Here are examples of some of the more commonly used "andable-connectors":

John and Fred ran (S+)
John ran and jumped (S-)
I saw Sue and Mary (O-)
I saw and greeted Sue (O+)
She left with John and Fred (J-)
The dog I saw and chased was black (B-, S-)
It was big and black (P-)
The dog and the cat I saw were black (B+, R+)
She arrived and left on Tuesday (MV-)
She did it quickly and efficiently (MV-)
The dog and cat ran (D-)
I told her that I was coming and that you would be late (TH-)
What did you tell her and what did she say (Wq-)
I left and she followed (Wd-)
Sue, a teacher and scholar, is here (MX-, Xc+, Xd-)

Some connector types are not included on the "andable" list; the corresponding conjunctive usages are therefore not permitted. Many of these usages are not exactly ungrammatical, but simply never occur, like PP+: "?They have and we have gone", G- ("?Fred Smith and Jones are here" [meaning "Fred Smith and Fred Jones are here]"), ND+ ("?She left three days or weeks ago"), and EA+ ("?She is very or somewhat competent"). Others truly seem ungrammatical, like Q+: "*Would or could you go?".

As well as "and", the system also handles the conjunctions "or", "but", "either-or", "neither-nor", "both-and", and "not only - but":

 +-----Sp----+--Os-+           +--Op---+
 |           |     |           |       |
we          ate popcorn or  watched movies
we either   ate popcorn or  watched movies
we neither  ate popcorn nor watched movies
we both     ate popcorn and watched movies
we          ate popcorn but watched movies           
we not only ate popcorn but watched movies 

All the words involved in conjunctive constructions--"and", "but", "or", "both", "nor", "either", "neither", "not", and "only"--must be included in the dictionary. (If such a word is removed, its conjunctive use will be disabled.) However, such words may also be given ordinary linkage expressions, and in fact are. These ordinary usages are considered along with their conjunctive usages. If no ordinary linkage expression is desired for a word, simply give it a linkage expression containing a dummy connector of some kind that will never be used, like "NO+".

A few usages of coordinating conjunctions are handled using ordinary link logic (this is why there are ordinary connector expressions for these words in the dictionary). There is some overlap between the special ("fat-link") handling of conjunctions and the ordinary handling, so that some sentences receive multiple parses. For example, ordinary clauses conjoined together will receive two parses: "John ran and Fred walked". See the entries in the Guide-To-Links on "W" and "CC" for discussion these ordinary usages of conjunctions.

5.3. SUBSCRIPTS. How should subscripts (on the connector names) be dealt with? When two or more connectors with different subscripts are combined with "and", they may only connect to a connector that may connect to all of them. For example, consider the following dictionary:

a:            Ds+; 
the:          D+;
those:        Dm+;
cats dogs:    Dm-;
cat dog:      Ds-;

Among the determiners above only "the" can grammatically be allowed to modify the "and" list "cats and dog". This is because the only connector which matches Dm- and Ds- is D+, not Ds+ or Dm+. This is the solution we implement.

There is an exception to be handled here, however. The system we've described so far would accept "the dog and cat runs", while rejecting "the dog and cat run". Both of these judgements are wrong because in English whenever two singular subjects are "anded" together they become plural. We have incorporated this exception: "Ss+" connectors, when "anded" together, may connect to an "Sp-", but not to an "Ss-".

5.4. SOME PROBLEMS. There are a few problems to be discussed. Some of these are handled by the current system; others are not. One problem is sentences like the following:

I gave Bob a doll and Mary a gun.
This is a problem Moscow created and failed to solve.

The former will be rejected since in "I gave Bob a doll", "gave" is linked to both "Bob" and to "doll". Thus, "Bob a doll" cannot be an element of a strict "and" list. In the second sentence, "Moscow" needs to connect to "failed" and "problem" must connect to "solve", so "problems to solve" cannot be an element of a strict "and" list. This phenomenon does occur (although rarely) in formal English, so we would like to solve it. The problem remains in our current system.

Another problem arises with embedded clauses. Consider the following linkage of the sentence "I think John and Dave ran".

           |   |     |            |
           I think John and Dave ran

                             |    |
           I think John and Dave ran

This linkage is a combination of the following two sentences "I think John ran" and "Dave ran". This linkage should clearly be rejected. (Actually, this linkage would not be found anyway in the current version, but this is a simple demonstration of the problem.) Intuitively, the problem with this linkage is that the same "S" link (the one between "and" and "ran") is being used to mean something that "I think" ("John ran") and also something that is just a fact ("Dave ran"). We have devised a system for detecting such patterns, using post-processing (see section 6). As mentioned above, we handle conjunction sentences by expanding the original sentence into several subsentences. We then compute the domain structure of the resulting linkage of each sentence. Finally, the original linkage is deemed incorrect if the nesting structure of a pair of links descending from the same link ("e.g." the "S" links in the two sentences above) do not have the same domain ancestry (are contained in the same set of domains). Linkages which are considered incorrect in this way have the message "inconsistent domains" at the top of the display.

Another problem concerns the different kinds of conjunctions. Right now, our system does not distinguish between the various kinds of conjunctions allowed; any of them may be used with any "andable" connector. However, there appear to be different constraints on different conjunctions. This results in some false positives:

I saw John and Fred
*I saw John but Fred
The dog or cat ran
*The either dog or cat ran

A few other smaller problems should be mentioned. Sometimes adverbs are used with conjunctions:

He talked to Steve and, apparently, Fred
*He talked to, apparently, Fred

As the second sentence shows, it is the conjunction that makes the first adverbial use valid. We have no good way of handling this construction. Secondly, there are some special uses of punctuation with conjunctions. Sometimes, a comma is inserted both before and after the final element in an "and" list (ex.1). And sometimes semi-colons may be used instead of commas, particularly when the and-list elements themselves contain commas (ex. 2).

1. John, and Steve, are coming
2. John; my advisor, Steve; and several other people are coming

Thus our system still needs some work in the area of conjunctions.

6. Post-Processing

6.1. THE LOGIC OF POST-PROCESSING. Besides conjunctions, there are certain phenomena in English which the parser is incapable of dealing with in its basic form. To solve these problems, we developed a post-processing system, based on a concept we call "domains". A domain is a subset of the links that make up a sentence. After a linkage has been found, the post-processing mechanism goes through the linkage and divides the sentence up into domains based on the kind of links that are present in the sentence. It then further divides the links into "groups": sets of links which share a particular domain membership. It then applies rules which may declare the linkage invalid based on the combinations of links present in a given group.

The information about the the domains to be formed and the rules to be applied must be included in a "post-process knowledge file". The file we provide with version 4.0 is called "4.0.knowledge". This is the default post-process file, and will be used if no other file is specified. As explained earlier, a post-process file can be specified on the command line as follows: parse [dictionary] -pp [post-process file] If you type "-ppoff" rather than "-pp [post-process file]", then no post-process file will be used and no post-processing will be done.

6.2. THE DOMAIN STRUCTURE. A domain is started by a certain type of link; we call this the "root link" of the domain. The word on the left end of this link is the "root word". Different types of links start different types of domains. To understand how domains are formed, imagine a linkage as a kind of street map, through which paths can be traced. In most cases, the domain of a link consists of all the links in the sentence that can be reached from that link, without tracing through the root word. (Another way of saying this is: the domain contains everything reachable from the right end of the root link, without tracing through the root link itself.) For example, assume that the C link in the following sentence begins a domain. This domain will include the Sp link and the I, but not the Ss and the O.

         +-Ss+O-+   +Sp(e)+I(e)+        
         |   |  |   |     |    | 
	He told me they would  go

(The letters in parentheses indicate that the Sp and I link are in an e-type domain.) Very roughly speaking, domains tend to include the links to the right of that link. However, there are many exceptions. In the following sentence, the C link starts a domain; this includes the Ss and O links, but not the Xc, CO, or Sp, even though the Sp is entirely to the right of the C.

          +-------Xc--------+  |
          +-C-+Ss(s)+O(s)+  |  +Sp+
          |   |     |    |  |  |  |
	After he   saw  us  ,  we left

In the case below, the domain started by the C link actually extends back to the left of the root link, to contain the B link as well. This follows naturally from the way domains are defined. The B link can be reached from the right end of the C link, without tracing left through the C. In this case, note that the B link extends to the left of the root word.

         |   +---I---+         |
         |   +SI+    +-C--+S(e)+
         |   |  |    |    |    |
	Who do you think you  saw

If one wants a certain link type to start a domain, it must be included in the list "DOMAIN_STARTER_LINKS" in the file "post-process.knowledge". It must also be included in the following list, ("STARTING_LINK_TYPE_TABLE"), along with the name of the domain-type that it starts.

As we saw above, it is possible for domains to be traced back to the left of the root word; and from there, they might be extended indefinitely. This tends to be prevented, however, by what we call "restricted links". If a link is restricted, this means that if a domain is being traced through it to the left, and if the link extends to the left of the root word of the domain (the left end of the root link), the domain will be traced no further through the restricted link. So, for example, Bsw is a restricted link. This means that, in the sentence above, the "e" domain started by the C will contain the Bsw; but if there are any links which can be traced from the left end of the Bsw (i.e., another link coming out of the word "who"), they will not be included in the domain.

There are a few other complications in the way domains are generated.

1. The root link of a domain may or may not be included in the domain it starts. Root links will not be included, unless they are listed in a special list in post-process.c, "DOMAIN_CONTAINS_LINKS".

2. Some domain-types are "bounded domains". This means that they are not allowed to extend to the left of their root link at all (even with a "restricted link"). If they do extend in this way in a linkage, the linkage will be declared invalid. (See "C" in the Guide to Links for an explanation of how this is used.) Domain types in this category must be listed under "BOUNDED_RULES".

3. As well as ordinary domains, there are two special kinds. One kind is "urfl" domains. As well as including everything that can be reached from the right end of the root link, these domains include everything that can be reached from the left end of the root link, tracing to the right, underneath the root link (but not over it), hence the name -- "Under Root From Left". In the example below, the TOo link starts an "urfl" domain; as well as including the "I" link, as a normal domain would, this domain also includes the O.

      +-S-+O(x)+   +I(x)+
      |   |    |   |    | 
      I asked him to   go

The final kind of domain is "urfl-only". These include ONLY links that can be reached from the left of the root link, tracing to the right underneath the root link. In the case below, the Ss*g link starts an "urfl-only" domain.

         +---O(d)---+    |
         +     +D(d)+    +-O+  
         |     |    |    |  |  
      playing the piano is fun 

"Urfl" and "urfl-only" domains are defined by the link-types that start them. Any link that is added to the "URFL_DOMAIN_STARTER_LNKS" list will start an "urfl" domain; any link that is added to the "URFL_ONLY_DOMAIN_STARTER_LINKS" will start an "urfl-only" domain.

4. Finally, certain link types are simply ignored by post-processing. No domains may be traced through them, nor can they start domains. These are listed under "IGNORE_THESE_LINKS".

Domains may be nested; a link may therefore be in several domains at once. The domain membership of a given link can be shown in the following way:

  +--W-+-S+-P-+-MV-+-C+-S-+-C+-S+--P--+       |
  |    |  |   |    |  |   |  |  |     |       |
///// he got mad when I said I was leaving  \\\\\

               /////          RW      <---RW---->  RW        RIGHT-WALL
 (m)           /////          Wd      <---Wd---->  Wd        he
 (m)           he             Ss      <---Ss---->  S         got
 (m)           got            Pa      <---Pa---->  Pa        mad
 (m)           mad            MV      <---MVs--->  MVs       when
 (m)           when           Cs      <---Cs---->  C         I
 (m) (s)       I              Sp*i    <---Sp*i-->  S         said
 (m) (s)       said           Ce      <---Ce---->  C         I
 (m) (s) (e)   I              Sp*i    <---Spii-->  S*i       was
 (m) (s) (e)   was            Pg      <---Pg---->  Pg        leaving

To have the parser produce a "link list" of this kind, type "!links". (In the structure shown here, the domain structure is strictly hierarchical; every domain that is partially inside another is completely inside that domain. There is no a priori reason why domain structure should always be strictly hierarchical; but we believe that, given the current grammar, it will almost always be strictly hierarchical, except in a few very obscure cases.)

6.3. GROUPS. The domain structure is really a means to creating a more useful kind of structure. This is the "group". A "group" of links is the set of links that have the same domain membership. In the above example, then, the "Spii" and "Pg" are part of the same group. The "Ce" and "Sp*i" are in another group. The Ce and Pg are not in the same group. Groups correspond roughly to subject-verb expressions - groups of links that are part of a clause, but not part of any dependent clauses within that clause. For example, in the above case, "He got mad" is one subject-verb expression; "I said ..." is another, "I was leaving" is a third.

6.4. POST-PROCESSING RULES. The domain structure thus divides a sentence into groups of links. This then allows us to enforce constraints on the link-types that are in a subject-verb expression. This is useful in cases where there are constraints on the combinations of links that can be present in a clause, but the links may be separated (i.e., they may not all connect to the same word), making the constraints difficult to enforce using link logic. We do this by using certain kinds of rules. One is the "contains_one" rule. This says that if a group contains a link of a certain type, it must contain a link of another type. Another kind of rule is the "contains_none" rule. This says that if a group contains one kind of link, it may not contain any of a certain kind of link.

In each case, we have a "triggering" link-type: a link that triggers the rule, and enforces a certain constraint. We also have a "criterion" link-type: a link that defines the constraint (whether it is "must contain X" or "may not contain X"). The triggering link for a rule must be contained in a line of the form

        X, Y Z, "Warning!",

where "X" is the triggering link-type, "Y" and "Z" are the criterion links for the rule, and "Warning" is a message that will be outputted when the rule is violated. (This warning will only be outputted when the incorrect linkage itself is outputted, and this only happens in "!bad" mode. It will also only be outtputed when the "!links" display is shown.)

If such a rule is listed under "CONTAINS_ONE_RULES", this means, "A group containing an X connector must contain either a Y or a Z". (The group may contain more than one from the set: a Y and a Z, 2 Y's, etc.) If it is listed under "CONTAINS_NONE_RULES", this means "a group containing an X link may not contain either a Y or a Z".

One can also define a symbol as a set of links:

        @Blah: Y Z

In this line, "@Blah" is defined as the set of links {X, Y}. "@Blah" can then be used in actual rules. Given this definition, the rule below will have the same effect as the rule stated above:

        X, Blah, "Warning",

where "Y" and "Z" are the criterion links for the rule.

A further kind of rule is used in post-processing, which does not relate to domain structure. If a certain link type is included in the list "MUST_FORM_A_CYCLE", this means that any links of that type may only be used in cycles. For example, "R" is on list. The linkage below satisfies this rule in the case of "R", because "R" is used in a cycle here.

  +----Wd---+----Bs---+     |     
  |    +-Ds-+--R-+Sp*i+     +-Pp-+
  |    |    |    |    |     |    |
///// the man.n I.p saw.v is.v here 

Certain link-types (relating to punctuation) proved to disrupt the domain structure; it seemed preferable for them to simply be ignored by post-processing. Thus we created a list in post-process.knowledge, "IGNORE_THESE_LINKS", which are ignored for in the creation of the domain structure (that is, domains are not traced through them), and also in the application of the post-processing rules. (They may be included in the "MUST_FORM_A_CYCLE" list, however, if desired.)

6.5. POST-PROCESSING AND CONJUNCTIONS. The interaction of post-processing and conjunctions is straightforward. As discussed earlier, any linkage with a conjunction can be regarded as several sub-linkages, and in fact is generally displayed that way. In post-processing, these sub-linkages are treated individually; domains are generated and rules are applied in each sub-linkage. A conjunctive linkage is only approved by post-processing if all of its sub-linkages are approved.

6.6. LINK-TYPE MATCHING IN POST-PROCESSING. Link-type matching in post-processing requires some explanation. In the dictionary, subscripts are used to create sub-categories of connector-types. "Ss+" will link with "Ss-" and "S-", but not "Sp-". The character "*" is used as a wild-card; it will match to any character. An unsubscripted connector name, like "S+", can thus be regarded as equivalent to "S***...+"

The post-processor also requires a system for matching connector types. While the linkage stage is looking at link-types on connector expressions, post-processing is looking at the resulting link-types that are formed when a linkage is complete. (If an "Ss+" has linked together with a "S*a-", what the post-processor sees is "Ssa".) It is then comparing them to link-names listed in "post-process.c" (as domain-starting links, as triggering links for rules, as criterion links for rules, etc.). The link-type matching system used in post-processing is similar to the linkage-level one, but it is a little different. As mentioned above, from the dictionary's point of "*" is a wild-card character that matches to any other character (see section 3.1). From the post-processor's point of view, however, "*" is just another character; "S*" in a linkage does _not_ match with an "Ss" in post-process.knowledge. Nor does an "S*" in post-process.knowledge match with an "Ss" in a linkage. However, there is also a wild-card character in post-processing; this is "#". An "S#" in post-process.knowledge does match with a "Ss" in the dictionary, as well as with an "S*".

As mentioned above, at the linkage stage, the parser treats any unsubscripted places in a link name as "*"'s. For the post-processor, however, any empty subscript places in a link name - either in a linkage or in a post-processing rule - are treated as "#'s". In short: at both the linkage stage and the post-processing stage, empty subscript places are treated as wild-cards that match to anything. For explicitly labeling a subscript place as a wild-card, however, one uses "*" at the linkage-stage and "#" in post-processing.

The specific uses of post-processing are fully explained in the "Guide to Links". For example, see "SF: filler-it"; "SI", and "MV: Comparatives".

6.7. THE "!BAD" COMMAND. For people editing the dictionary or post-processing, a useful display feature is "!bad". The usual running of the parser is that it will display the linkages in order of cost (lowest-cost ones first); additional linkages after the first one can be seen one at a time by typing RETURN. Only linkages passing post-processing are outputted. If the "!bad" variable is toggled, however, the parser will output _all_ of the linkages it finds, including those that fail post-processing. The invalid linkages will be outputted after the valid ones, one by one, as long as the user types RETURN. The parser will only output the linkages it finds at whatever stage it had gotten to when it found a valid linkage. For example, if it had gotten to null-link stage 2 before finding its first valid linkage, it will also output invalid linkages found at null-link stage 2 (see section 7.1 for explanation of null-links). There is no way of seeing invalid linkages found at earlier stages.

7. Speed and Robustness Features

The original version of the parser, as described in previous sections, did an exhaustive search for all correct linkages; if none were found, it produced nothing. However, this meant that the parser was not "robust": it could not do anything useful with a sentence unless it could parse the entire thing. It was also quite slow. We have developed several remedies to these problems.

7.1. THE NULL-LINK SYSTEM. An important feature of the parser is the "null-link" system. This effectively allows robust parsing: that is, it allows the parser to assign some structure to a sentence even when it cannot fully interpret it. Basically, if the parser cannot parse a sentence normally (that is, if it cannot find any valid linkages), it tries ignoring one word in the sentence. It finds all the linkages it can, ignoring just one word (some linkages may ignore one word, some may ignore another). This is "null link stage 1". Failing that, it then attempts to find linkages ignoring 2 words. This is "null link stage 2". Failing that, it will continue to increment the number of null links, until it finds some valid linkages; it will then output all the linkages found at this stage, and stop. There may be some cases where it cannot find a valid linkage unless it ignores _all_ the words in the sentence; in this case, it simply gets to "null link stage N" (where N is the number of words in the sentence), and then gives up.

In the graphic display, "null-linked" words are shown in brackets, with no links attached to them. In the sentence below, "gosh" and "," are null-linked:

             |       |        |        |
[gosh] [,] this sentence.n uses.v null-links 

In null-link parsing, the connectivity requirement is suspended (see Section 1.3). This means that disconnected "islands" may form. However, each island represents one added null link. That is, if a sentence can be parsed as three disconected islands (but with all the words otherwise connected with regular links), this will linkage will be found at null link stage 2.

Note that the null-link systems respects post-processing. It keeps incrementing the number of null-links until it finds linkages that pass post-processing. For example, if linkages are found at null-link stage 0, but they all fail post-processing, the parser will decide that no valid linkage has been found and will proceed to null-link stage 1. However, if some of the links in a linkage are not indirectly connected to the left-wall--as may occur in some "islands"--they may not be contained in any post-processing domain. Such links will not be subject to "contains_one" and "contains_none" rules, which apply only to links that are in a domain.

The null-link system can be turned on or off by typing the command "!null". The default is that null-links are on. If null-links are turned off, then, when ther parser is unable to find a complete linkage for a sentence, it will say "No complete linkages found", and prompt for the next sentence.

7.2. THE LINK-LENGTH LIMIT. In studying the parser's performance on very long sentences (on which it was often very slow), we discovered that it was often considering extremely long links even for link-types which are generally very short, such as "D" (connecting determiners to nouns). For this reason, we installed a "link-length-limit": links are only allowed to be a certain length, in terms of the number of words from end to end. However, not all link-types are always short; some are sometimes quite long, such as "S" and "MV". Thus we include a list in the dictionary, "UNLIMITED-CONNECTORS", for link-types which are to be exempted from this limit. The limit for short link-types can be adjusted while running the parser using the command "!short=[length]". The default value for this is 6.

If you wish to disable the length limit, so that all link-types may be any length, remove the "UNLIMITED-CONNECTORS" list from the dictionary. If you want all link-types to be _short_, so that none are exempted from the length limit, you must include the "UNLIMITED-CONNECTORS" list in the dictionary, with some bogus connector such as "NO+", but otherwise empty.

The interaction between the length limit and conjunctions is complex. In sentences containing a conjunction, it seemed sensible to relax the length limit. For example, D links are very rarely 7 words long (as in sentence 1 below), but they might be this long in a sentence containing a conjunction (sentence 2).

 |                           |
The big big big big big big dog ran (rejected)

The big big dog and big big cat ran (accepted)

The logical thing to do seemed to be to apply the length limit only to subsentences; that is, when the length of a link to one element in an "and"-list is calculated, the length of any intervening elements in the "and"-list is subtracted. We call this adjusted length the "effective length" of a link. Therefore sentence 1 above is rejected; sentence 2 is accepted. In null-link stages, however, this is not done; link length is simply equal to actual length. (For complex reasons, linkages with conjunctions which exceed the length limit may occasionally be accepted in null-link stages.)

7.3. THE POST-PROCESSING LIMIT. Since post-processing proved to be a major source of the slowness of the parser, we installed a "post-processing limit". This is simply a limit on the number of linkages that will be considered by post-processing. If the limit is set at 100 (this is the default), then only 100 linkages will be considered by post-processing, even if many more than that are generated; the others will just be discarded. This means, of course, that the "best" linkage (by the parser's heuristics, for example) may be discarded. However, the linkages to be considered by post-processing are selected randomly from all the generated ones, which means that at least one linkage is likely to be found which is fairly similar to the correct one. The number of sentences considered by post-processing can be controlled with the variable "!limit=[number]".

7.4. THE TIMER. The parser also has a timer. When the timer is set to a positive value, the parser will simply abort parsing of a sentence after that number of seconds has elapsed. (When the time is set to -1 -- its default -- it does nothing, and the parser will run indefinitely.) Set the timer while running the parser by typing "!timeout = [time value in seconds]".

7.5. PANIC MODE. Combining several of these features, we created a new mode of parsing called "panic mode", in which the parser can parse even very long sentences quickly, but with considerably reduced accuracy. Panic mode has the following features. 1) All link-types are considered "short", even those included in the UNLIMITED-CONNECTORS list. 2) Panic mode begins immediately with null-link parsing, skipping regular parsing (the main consequence of this is that the "effective length" of links is considered to be their actual length; see Section 7.3). 3) Whereas only connectors of cost 0, 1 or 2 are considered in regular parsing, all connectors are considered in panic mode. At the moment, the only connectors in the dictionary with cost greater than 2 (they have cost 3) are connectors on nouns which allow them to make no main connection at all:

dog man house: {@A-} & D- & (S+ or O- or J- or [[[()]]]);

In effect, this allows noun-phrases to be treated as complete entities, not connected to anything else. (Recall that in null-link parsing, the connectivity rule is suspended, so that disconnected "islands" may form.) In a typical "panic mode" linkage, then, various noun-phrases and clauses within the sentence will be fully parsed, but they will often not be connected to each other.

Panic mode is invoked whenever the timer runs out on regular parsing and no valid linkages have been found. For example, if the timer is set for 15 seconds, panic mode will begin after 15 seconds. Panic mode can be disabled (or turned back on again), while running the parser, by typing "!panic".

Davy Temperley
Last modified: Mon Mar 22 09:25:46 EST 1999