A User Manual for Link Grammars Davy Temperley CONTENTS 1. The logic and notation of link grammars 2. Other general features of the parser 3. Special features of the dictionary 4. Coordinating conjunctions 5. Post-processing 1.THE LOGIC AND NOTATION OF LINK GRAMMARS THE BASIC IDEA. Think of words as blocks with connectors coming out. There are different types of connectors; connectors may also point to the right or to the left. A left-pointing connector connects with a right-pointing connector of the same type on another word. The two connectors together form a "link". Right-pointing connectors are labeled "+", left-pointing connectors are labeled "-". Words have rules about how their connectors can be connected up, that is, rules about what would constitute a valid use of that word. A valid sentence is one in which all the words present are used in a way which is valid according to their rules, and which also satisfies certain global rules. WORD RULES. A simple dictionary entry would look like this: blah: A+; This means that if the word "blah" is used in a sentence, it must form an "A" link with another word; that is, there must be another word to the right of it with an "A-" connector. Otherwise the sentence is not valid. The expression following the colon is the "linking requirement" for the word. A word may have more than one connector that has to be connected. This would be notated as blah: A+ & B+; A word may have a rule that either one of two (or one of several) connectors can be used, but exactly one must be used. In the dictionary, we notate this as blah: A+ or B-; This means that if the word can make either an "A" link to the right, or a "B" link to the left, its use in the sentence is valid; but it must make one or the other, and it can not make both. These rules can be combined. For example, consider the following notation: blah: A+ or (B- & C+); This means that the word must make either an "A" link to the right, or a "B" connection to the left and a "C" link to the right. No other combination will be valid. Such expressions can be nested without limit, such as blah: (A+ or B-) & ((C- & A+ & (D- or E-)) or F+); Some connectors are optional; this is notated with curly brackets. For example: blah: A+ & {B+}; This means the word must make an "A" link to the right, and it can make a "B" link to the right but does not have to. Curly brackets can also be put around complex expressions, like blah: (A+ or B+) & {C- & (D+ or E-)}; A word can also make an indefinite number of links of the same type to other words. For this, we use the "multi-connector" expression "@". For instance, the word below could make any number of F links to words to the right (but is not required to make any). blah: (A+ or B+) & {C- & (D+ or E-)} & {@F+}; (If a word has "@A+", with no curly brackets, it is required to make at least one A+ link to the right; any others are optional.) The ordering of elements in the "connector expression" is important. What that dictates is the relative closeness of the words that are being connected to. The further to the left the connector name, the closer the connection must be. For example, blah: A+ & B+; This means that "blah" must make an "A" link to the right and a "B" link to the right, and the word it makes the "A" link with must be closer than the word it makes the "B" link with. This only pertains, however, to connections in the same direction. For connectors pointing in opposite directions, the ordering is irrelevant. Therefore blah: A+ & B-; means exactly the same thing as blah: A- & B+; For that matter, blah: A- & B+ & C+ & D-; means exactly the same thing as blah: B+ & C+ & A- & D-; For "or" expressions, such as "A+ or B+", the ordering of the elements is irrelevant. A dictionary entry thus consists of a word, followed by a colon, followed by a connector expression, followed by a semi-colon. The dictionary consists of a series of such entries. Any number of words can be put in a list, separated by spaces; they will then all possess the linking requirement that follows: blah blee blay: A+; A connector name must consist of one or more capital letters (any number may be used), followed by "+" or "-". GLOBAL RULES. As well as these "word rules", which are specified in the dictionary, there are a few other global rules which control how words can be connected. First of all, links can not cross. For example, the following way of connecting these four words (connecting "cat" to "dog" and "horse" to "fish") would be illegal. The parser simply will not find such linkages. +------------+ +---- | -----+ | | | | | cat horse dog fish This is the "crossing-links" (or "planarity") rule. Secondly, all the words in a sentence must be indirectly connected to each other. Therefore the following way of connecting these four words would be illegal (if it was the entire linkage). +-----+ +----+ | | | | cat horse dog fish This is the "connectivity" rule. A valid sentence is therefore one which can be linked up in a way that a) all the words are used in a way that satisfies their linking requirements; and b) the crossing- links and connectivity rules are not violated. RUNNING THE PARSER. To run the parser, you must have all the parser files (all the ones in the ftp directory ending in ".c", ".o", and ".h", as well as "Makefile") in the same directory. You must also have a dictionary in the directory, containing the words that can be used and their connector expressions. Then run the parser by typing "parse [dictionary]" (whatever the name of your dictionary file is). The parser will then ask you to type in sentences. It will tell you if the sentence you type in has a valid linkage or linkages, given the dictionary it is using, and it will output the linkage that it finds, showing the words that are linked together and the type of link between them. So, in the following linkage, the parser has found a "D" link between "the" and "dog", and an "S" link between "dog" and "ran". +-D-+-S-+ | | | the dog ran To exit the parser, type "!quit". The parser will do an "exhaustive" search for linkages; it will generate all valid linkages. It will begin by displaying the lowest-cost linkage it finds. Other linkages may then be seen, one at a time, by typing RETURN. (The ordering of the output is determined by the cost system: see "COST SYSTEM".) LINK GRAMMAR IN RELATION TO OTHER SYSTEMS. The structure assigned to a sentence by a link grammar is rather unlike any other grammatical system that we know of (although it is certainly related to dependency grammar). Rather than thinking in terms of syntactic functions (like subject or object) or constituents (like "verb phrase"), one must think in terms of relationships between pairs of words. In the sentence below, for example, there is an "S" ("subject") relation between "dog" and "has"; a "PP" (past-participle) relationship between "has" and "gone"; and a "D" (determiner) relation between "the" and "dog". +-----Ds-----+ | +---A--+-Ss-+-PP-+ | | | | | the black.a dog.n has gone Where possible, we try to give link-types names that have mnemonic significance in this way. It may be seen, however, that parts of speech, syntactic functions, and constituents may be recovered from a link structure rather easily. For example, whatever word is on the left end of an "S" link is the subject of a clause (or the head word of the subject phrase); whatever is on the right end is the finite verb; whatever is on the left-end of a D link is a determiner; etc.. As for constituents, the verb phrase of a clause contains whatever words are on the right end of an S link (that is, whatever words can be traced through the link diagram by starting at the S link and tracing to the right); the subject phrase consists of words on the left end of the S link (but not tracing through links such as "C" and "W" which lead out of the clause), and so on. Therefore, if it was desirable to identify traditional syntactic elements like constituents and syntactic functions from the linkage structure, we think it would be quite possible to do so. If you wish to get a better of understanding of this particular grammar, we suggest the following. Try inputting sentences to the parser that seem simple or basic to you. See what kind of links the parser assigns to them (probably things like S, D, O, MV, A, and P). Look these up in the "Guide to Links" file, and try to understand how they work. Then continue with more complex sentences. (A simple version of the grammar is explained in our "Link Grammar Tech Report", available over ftp. However, some aspects of this simplified grammar are now obsolete. 2. OTHER GENERAL FEATURES OF THE PARSER CONNECTOR SUBSCRIPTS. In general, a connector may only link to another one with the same name, i.e., the same string of capital letters. However, there is another way of controlling how connectors may link to each other, using connector subscripts. A subscript is a lower-case letter following a connector-name, like "Ss+". An "Ss+" connector can connect with an unspecified "S-" connector, or an "Ss-" connector, but not with an "Sp-" connector. Connector types may have multiple subscript characters, such as "Spa+". An "Spa+" can connect with an "S-", an "Sp-", or an "Spa-", but not with an "Ss-" or an "Ssa-" or an "Spb-". An "*" subscript type is a "wildcard" that can connect with anything. Therefore, an "S*+" is exactly the same as an "S+". An "S*a+" can connect with an "S-", an "Ss-", an "Sp-", or an "Ssa-", but not with an "Ssb-". MACROS. It is possible to define a single symbol as a longer connector expression, and then use that symbol to refer to the longer expression in the dictionary. To do this, simply choose a name for the longer expression, and surround it with angle brackets (<>). Then treat it like a word in the dictionary; list the name, then a colon, then the connector expression that it should stand for. For example, we define "" in the dictionary as follows: : (Ss+ & ) or SIs- or Js- or Os- or ({[Bsj+]} & Xd- & Xc+ & MX-); We then use this symbol in many other actual word definitions. We use many of these macros in the dictionary, to reduce redundancy; there are many connector expressions that are used over and over in longer expressions. Here are a few common ones: : the "main" connectors for nouns, used to link them to the rest of the sentence (as subject, object, etc). : the "sub" connectors for nouns, used to link them to modifiers like prepositional phrases and relative clauses. : These macros are for verbs; they distinguish different forms of the same verb. That is, they contain connector types - like S-, PP-, etc. - that distinguish different forms of the same verb. is for singular verbs, for past participles, for forms which are both simple past and past participle, etc.. : These macros are for verb complements; they stand for different complement expressions. Some verbs can connect to a direct object, using O+; some can connect to an infinitive verb, using TO+; and so on. WORD-FILES. The most basic way to write the dictionary is to list all the words in a particular category, followed by a colon, followed by their connector expression. There is another way, however. One can put all the words in a category in a file, choose a name for the file, and put that filename in the dictionary in place of the list of words. When listed in the dictionary, the filename must be preceded by a slash (/). Here are the word files that are in use at the moment: words.n.1 singular countable (i.e. not mass) nouns words.n.2.s plural nouns ending in "s" words.n.2.x plural nouns not ending in "s" words.n.3 mass nouns words.n.4 nouns that may be mass or countable words.n.c.1 singular nouns that are sometimes capitalized words.n.c.2 plural nouns that are sometimes capitalized words.n.p proper names that are also ordinary words when not capitalized (see "CAPITALIZATION" for explanation) (In the following verb files, the final number indicates the verb form. ".1" is for infinitive-plural forms, ".2" is for singular forms, ".3" is for simple-past / past-participle forms, ".4" is for present participles, ".5" is for gerunds. On intransitive verbs, the present participle and gerund expression are combined into a single dictionary entry.) words.v.1.(1-4) intransitive verbs words.v.1.p special two-word passives ("lied_to_", "paid_for") words.v.2.(1-5) optionally transitive verbs words.v.4.(1-5) transitive verbs words.v.5.(1-4) intransitive verbs that may form two-word verbs with particles like "up" and "out" words.v.6.(1-5) optionally transitive verbs that may form two-word verbs words.v.8.(1-5) transitive verbs that may form two-word verbs words.v.10.(1-5)verbs that may be used in quotation expressions, like "said" ("John is here, he said"). words.adj.1 ordinary adjectives, with no special complements words.adj.2 ordinary comparative adjectives (e.g. "bigger") words.adj.3 ordinary superlative adjectives (e.g. "biggest") words.adv.1 ordinary manner adverbs ("quickly", "angrily") words.adv.2 ordinary clausal adverbs ("fortunately") words.adv.3 adverbs like "chemically" words.y common year numbers ("1990", etc.) words.s US state names and abbreviations WORD SUBSCRIPTS. A single word can be given several different dictionary entries. To do this, the entries must be distinguished by giving the words different subscripts. Words may be followed by a subscript such as ".n". For example: run.n: A+ or B+... run.v: C+ or D+... The parser starts at the right end of every string of characters. Any sequence of letters to the right of the right-most period in the string will be considered the subscript. (Periods at the end of a string are simply considered part of the string; see "PUNCTUATION" below.) In searching for linkages, the parser will consider each entry for the word as a different word, and will generate all linkages found for all entries. The subscript is shown in the display, thus indicating which entry the parser chose for a particular linkage. The main word subscripts we use are ".n" for nouns, ".v" for verbs, ".a" for adjectives, and ".e" for adverbs. THE COST SYSTEM. We have a system for assigning a cost to a linkage. This allows the parser to express preferences among the linkages it finds. The cost system uses square brackets ("[" and "]"). If a connector, or a series of connectors, is surrounded by square brackets, it is assigned a cost. The amount of cost is equal to the number of square brackets on each side: [A+] will receive a cost of 1; [[A+]] will receive a cost of 2; etc.. The parser uses this cost as a criterion for deciding which linkage to output first; it outputs them in order of cost (i.e., lowest cost first). (Given several linkages of the same cost level, the parser has certain heuristics for choosing the best parse, i.e., the one to output first. It prefers the linkages in which the total length of the links is lowest; and, given a linkage with conjunctions, it prefers a linkage where the length of the conjoined word-lists are similar.) THE TWO-STAGE SYSTEM. Related to the cost system is a two-stage system. In working on expanding the parser's coverage, we noticed that there were certain constructions which came up now and then but which were relatively rare. Thse constructions could be added to the dictionary; but then the parser was spending a lot of time searching for rare constructions, even when perfectly common constructions could be found. It seemed sensible to devise a stage system, where the parser first only considers common constructions; if none can be found, it then considers less common ones. For this we use the cost system, explained above. If a connector or connector expression has a cost of 2 or more (i.e. if it is surrounded by 2 or more pairs of square brackets), it is considered a stage 2 connector, and is only considered at stage 2. The link index explains which uses of connectors that are stage 2. THE NULL-LINK SYSTEM. Another recent addition to the system is the "null-link" system. This effectively allows robust parsing: that is, it allows the parser to assign some structure to a sentence even when it cannot fully interpret it. Basically, if the parser cannot parse a sentence normally (that is, if it cannot find any valid linkages), it tries ignoring one word in the sentence. It finds all the linkages it can, ignoring just one word (some linkages may ignore one word, some may ignore another). This is "null link stage 1". Failing that, it then attempts to find linkages ignoring 2 words. This is "null link stage 2". Failing that, it will continue to increment the number of null links, until it finds some valid linkages; it will then output all the linkages found at this stage, and stop. There maybe some cases where it cannot find a valid linkage even if it ignores _all_ the words in the sentence; in this case, it simply gets to "null link stage N" (where N is the number of words in the sentence), and then gives up. (Note that the null-link systems respects post-processing. It keeps incrementing the number of null-links until it finds linkages that pass post-processing. For example, if linkages are found at null-link stage 0, but they all fail post-processing, the parser will decide that no valid linkage has been found and will proceed to null-link stage 1.) The null-link system can be turned on or off by typing the command "!null". BATCH-MODE. It is possible to make a file of sentences, and run them through the parser all at once. Simply create a file, with one sentence on each line. When running the parser, type parse [dictionary name] -batch < [filename] The user can indicate which files in a batch file should be rejected, and which should be accepted. (Batch-mode does not work with null-links; that is, it assumes that null-links have been turned off.) If a sentence should be rejected, precede the sentence with "*" in the file. If it should be accepted, precede it with nothing. If it should be accepted at stage 2 only, precede it with "-". After processing a batch file, the parser will then print the number of errors in the file: i.e., the number of sentences on which its judgments differ with the judgments indicated by the symbols. When running batch-mode, the parser will ordinarily output only the number of errors it makes. Thus it is primarily useful for checking sentences, to see if a particular dictionary produce the desired results (or to make sure that no earlier work has been broken by recent changes). If one begins the batch file with the command "!echo", the parser will also output the sentences as it parses them, as well as display information for the sentences on which its judgments disagree with the user's. (In "!echo" mode, the right and left walls will not be shown.) COMMANDS AND VARIABLES. It is possible to modify the running of the parser in various ways, while running it, by typing in certain commands. The basic commands can be seen by typing "!help". Others are listed under "!variables". Many of these are self-explanatory. For example, "!display" changes the width of the parser display; "!null" turns the null-link system on or off. A few features deserve special mention. Words can be added to the dictionary while the parser is running. If you type "!blah=bleck", the word "blah" will be added to the dictionary under the same category as the word "bleck". If you make changes, and then want to save them to your dictionary files, type "!save" before typing the usual "!quit". If you do not wish to save them, exit by typing control-C. (Unfortunately, saved changes can only be made to word-files, not to the main dictionary.) One useful display feature is "!bad". The usual running of the parser is that it will display the linkages in order of cost (lowest-cost ones first); additional linkages after the first one can be seen one at a time by typing RETURN. If the "!bad" variable is toggled, the parser will, instead, simply output all the linkages it finds, including those that fail post-processing. 3. SPECIAL FEATURES OF THE DICTIONARY CAPITALIZATION. The parser respects capitalization: that is, the use of upper- and lower-case letters. If a string is listed in the dictionary beginning with a capital letter, then a word that is inputted will only match it if it has the same capitalization. (The same with strings with capital letters in the middle, although this is probably of little use.) However, there are a few special cases here. There is a general category in the dictionary called "CAPITALIZED_WORDS". This is the default category for words whose first character is capitalized. Any such word which is inputted which is not explicitly listed in another category will be assigned to this category. This is of course useful, since most capitalized words are names which are grammatically all the same. A special situation occurs with words at the beginning of the sentence. If a sentence-initial word has an uncapitalized first letter, it is treated in the normal manner. If it is capitalized, the parser will first look to see if it is listed in the dictionary as a either a capitalized word or an uncapitalized word. If not, it will then assign it to the generic "CAPITALIZED_WORDS" category. (If the word is listed both as a capitalized word and an uncapitalized one, the parser will try to use it in both ways. Because there are certain words which are also common names, like "Will" and "Rob", we have created a special category for them, so that when they are used sentence-initially, they will be recognized as possible names.) HYPHENATED EXPRESSIONS. The dictionary also contains a special category called "HYPHENATED_WORDS". If a string contains a hyphen, and it is not listed in the dictionary, the parser will assign it to the category "HYPHENATED_WORDS". This is, again, useful, since hyphenated words are used somewhat "productively", and it would be very difficult to list them all. NUMBER EXPRESSIONS. The dictionary contains a category "NUMBERS". Any numerical expression -- that is, a string consisting entirely of numerical characters -- will be assigned to this category unless it is explicitly listed elsewhere in the dictionary. (The string may also contain a period, i.e., a decimal point.) THE UNKNOWN WORD. Recently we have installed a dictionary feature called the unknown word. A category can be defined using the string "UNKNOWN-WORD.x", where x is any subscript. If a word beginning with a lower-case letter is typed in that is not recognized, it will be assigned to that category. The word is then displayed with a question-mark in brackets, like "blah" below: +-----Wd----+ | +---D--+--Ss--+-Pp+ | | | | | ///// the blah[?].n is here Several different unknown word categories may be generated, labeled with different subscripts: for example, corresponding to nouns, verbs, and adjectives and adverbs. (These are the four categories we use, labeled .n, .v, .a, and .e, respectively.) The parser will search for all linkages that can be found using each entry. If it only finds a linkage for the "noun" category, then the output will show the unknown word labeled ".n": in effect, the parser is then guessing that the word is a noun. THE SORTING OF UNKNOWN STRINGS. Notice that the parser must make decisions about how to sort unknown strings into these categories, and the ordering of the decisions is important. At the moment, the parser proceeds as follows: 1. Look up string in the dictionary as is. If there is an exactly matching string in the dictionary, use that. 2. If the string contains a hyphen, assign to the category "HYPHENATED-WORDS". 3. If the string begins with a capital letter, assign it to the category "CAPITALIZED-WORD". 4. If the string consists entirely of numbers, assign it to the category "NUMBERS". 5. If the string fits none of these descriptions, assign it to the category "UNKNOWN-WORD". 6. If none of these alternatives are available, the parser will say: "the word "[whatever]" is not in the dictionary". (Step 4 is mutually exclusive with steps 2 and 3; therefore their relative ordering is unimportant.) The category "UNKNOWN-WORD" can simply be removed from the dictionary. In this case, the parser will proceed as shown above, except it will skip step 5. THE WALL(S). It proved to be useful to imagine that there was a dummy word at the beginning of every sentence. We call this "the wall". The wall has a linking requirement like any other word; it is listed in the dictionary under "LEFT-WALL". If this entry is included in the dictionary, the wall will be automatically inserted at the beginning of every sentence. Because of the connectivity rule, it is then necessary for the wall to be linked to the rest of the sentence in order for the sentence to be valid. Recently, we installed a "right-hand wall", which is similar to the original wall at the other hand of the sentence. This is only needed for certain punctuation phenomena. In most sentences, we use a special "RW" connector to simply connect the left hand wall to the right hand one. The right-wall's dictionary entry is "RIGHT-WALL". Either wall can be deactivated by simply removing the "LEFT-WALL" or "RIGHT-WALL" entry from the dictionary. (The right-wall can also be hidden from the display in cases where the usual "RW" link is being used; to do this, type "!walls".) IDIOMS. A string of words can be defined as a single dictionary entry. To do this, simply join the words together with underbars: a_la_mode: A+ or B-; Most idioms can be interpreted either as a single "idiom" or as a string of words (for example, "in question"). In this case, the parser will find all linkages with both interpretations. PUNCTUATION. The parser is capable of handling a variety of punctuation symbols. There are two issues to be discussed here. One is the listing of symbols in the dictionary; the other is the way they are "read" by the parser when they are used in sentences. Punctuation symbols can be listed in the dictionary just like words, and given ordinary linkage expressions. The same is true for strings containing multiple punctuation symbols or a mixture of letters and punctuation. The problem here is that certain punctuation symbols are also used as the "syntax" of the dictionary: colons, semi-colons, ampersands, etc.. Our solution to this is as follows: when listing these special characters, or a string containing them, one must put them in quotation marks: ";": A+ or B-; "+": C+ or D-; (The special characters that must be treated this way are precisely those which are used in the dictionary in a "syntactic" way: "(", ")", "{", "}", "[", "]", "@", "%", "&", "*", "+", "-", "/", "<", ">".) When punctuation symbols are used in sentences, they will be used in linkages according to the connector expressions listed in the dictionary, in the normal way. There is a difference, however. It may be noted that although many punctuation symbols are similar to words in the ways they are used, they are often not separated from preceding or following words by spaces. To handle this, before the parser begins parsing the sentence, it goes through and inserts spaces before or after certain punctuation symbols (but only when the symbol occurs at the end of a string). It will then parse the sentence as normal; since the punctuation symbol now has spaces on either side, it will naturally be treated as an independent word. The symbols that are "stripped" in this way are as follows: "Right-stripped characters" (when these occur at the right end of a string, the parser will insert a space to the left): , . ! ? % ) ' 's "Left-stripped characters" (when these occur at the left end of a string, the parser will insert a space to the right): $ ( Note that the "right-stripped" list includes the string "'s". This is useful for possessives ("John's") and contractions ("he's", "it's"). A further point: the parser will perform this stripping process on a string repeatedly, if necessary, perhaps stripping off several characters. For example, it will convert the first sentence below into the second: John, a professor (who got a raise of 5%), is here John , a professor ( who got a raise of 5 % ) , is here There is one difference between right-stripped and left-stripped characters. Right-stripped characters can also be used at the right end of a string defined in the dictionary; for example, one could actually define "Mrs." or "it's" as ordinary dictionary entries. Before stripping off right-stripped characters, the parser will make sure that the string is not present in the dictionary as it appears. If any of these symbols are used in the middle of strings, they will be treated just like any other symbols. (And if they are used in sentences in undefined strings, they will be treated as some kind of unknown string.) (Note that periods are special, however; if the string ends with a period followed by a letter, this will be interpreted as the word subscript, not part of the string. See WORD-SUBSCRIPTS above.) One exceptional case is quotation marks. Quotation marks may not be defined in the dictionary; and they are simply ignored when they are used in sentences. This is sufficient to handle most uses of quotes; generally, the presence of quotes does not affect the well-formedness of sentences, and it is often only subtlely affects meaning. However there are a few constructions, such as the last pair of sentences below, which seem to be only correct when quotes are included. "John is leaving," she said John is leaving, she said I am meeting with my "advisor" today I am meeting with my advisor today She said, "John is leaving". ?She said, John is leaving. We are unable to control such usages at the moment. 4. COORDINATING CONJUNCTIONS Coordination constructions do not fit naturally into the framework of link grammars. We have devised a method for automatically transforming the given link grammar into another one that captures the desired phenomena. This system is hard-wired in, and cannot easily be modified by the user. However, it has proven to be effective in handling the vast majority of uses of conjunctions. Our discussion will focus on the word "and", although the ideas apply to the use of "or", "either-or", "neither-nor", "both-and", and "not only - but". USES OF CONJUNCTIONS. We begin by proposing a simple definition of the use of "and" within the framework of link grammars. Then we'll mention a few problems with the definition, and suggest an improvement. The second definition is the one used in our system. It has drawbacks, but on balance it has proven to be remarkably effective. Given a sequence S of words containing the word "and", a "well-formed and list" L is a contiguous subsequence of words satisfying these conditions: 1. There exists a way to divide L into components (called "elements" of the well-formed "and" list) such that each element is separated from its neighboring elements by either a comma or the word "and" (or both). (The comma and the "and" are not part of the element.) The last pair of elements must be separated by "and" (or a comma followed by "and"). For example, in "The dog, cat, and mouse ran", "dog", "cat", and "mouse" are the elements of the well-formed "and" list "dog, cat, and mouse". 2. Each of the sequences of words obtained by replacing L (in S) by one of the elements of L is a sentence of the link grammar. 3. There is a way of choosing a linkage for each of these sentences such that the set of links outside of the "and" list elements are exactly the same in all of the sentences, and the connectors joining the sentence with its "and" list element are the same. In other words, if we cut the links connecting the element to the rest of the sentence, remove that element from the sentence, and replace it by one of the other elements, then the cut links can be connected to the element so as to create a valid linkage. The sequence S is grammatical if each instance of "and" in it is part of a well-formed and list. For example, consider the sentence "We ate popcorn and watched movies on TV for three days." The the phrase "ate popcorn and watched movies on TV" is a well-formed "and" list because it can be divided into elements "ate popcorn" and "watched movies on TV ", which satisfy all of the conditions above. The following two linkages show this. Note that in both cases the element is attached to the rest of the sentence with an "S" to the left and an "MV" to the right. +-------------------MVp-------------------+-----Jp-----+ +-Wd-+Sp+--Os--+ | +--Dmc-+ | | | | | | | ///// we ate popcorn.n for.p three days.n +---------MVp---------+-----Jp-----+ +-Wd-+----------Sp----------+---Op---+--Mp-+Js+ | +--Dmc-+ | | | | | | | | | ///// we watched movies.n on TV for.p three days.n There is a major problem with this definition. It contains no requirement that the words of an element of an "and" list be connected to each other, nor be related in any way (other than being contiguous). This allows many clearly ungrammatical sentences to be accepted, and generates numerous spurious linkages of correct sentences. For example, it would imply that "I like the beer John and wine Harry drank" is a valid sentence. We have two techniques to limit the set of sentences deemed grammatical by this rule. The first is to simply restrict the types of connectors that can connect the element of the "and" list to the rest of the sentence. The list of connectors allowed to do this is contained in the list "ANDABLE-CONNECTORS" in the dictionary. (If a connector type is included in this list, this means, in effect, that several of them may be joined to a connector of the opposite type. So, including "S-" on this list allows "John ran and skipped".) The second method is to restrict the definition of a well-formed "and" list. Say that a well-formed "and" list is a "strict and list" if it also satisfies the following condition: Each element must be connected to the rest of the sentence through exactly one of its words. (It may use many connectors.) This is the system that we have implemented. This logic of dealing with conjunctions is reflected in the parser's output. A sentence with conjunctions is outputted showing the sentence split up into several sub-sentences: +---J--+ +--------S----------+-MV+ +-D-+ | | | | | John, Dave, and Fred ran in the park +---J--+ +---S----------+-MV+ +-D-+ | | | | | John, Dave, and Fred ran in the park +---J--+ +S--+-MV+ +-D-+ | | | | | John, Dave, and Fred ran in the park This implementation of conjunctions is "hard-wired" in, and cannot be easily modified. However, it covers the vast majority of uses of coordinating conjunctions. It allows a wide variety of connector types to be used with "and": nouns (whether as subject, object, prepositional object, etc.), determiners, verbs, adjectives and participles, prepositions, and adverbs. It also covers nested "and" structures, like "The people and their sons and daughters were there." And as well as handling "and", the system also handles the conjunctions "or", "but", "either-or", "neither-nor", "both-and", and "not only - but". SOME PROBLEMS AND EXCEPTIONS. There are a few problems and exceptions to be discussed. Some of these are handled by the current system; others are not. One problem is sentences like the following: I gave Bob a doll and Mary a gun. I know the problem Moscow created and failed to solve. The former will be rejected since in "I gave Bob a doll", "gave" is linked to both "Bob" and to "doll". Thus, "Bob a doll" cannot be an element of a strict "and" list. In the second sentence, "Moscow" needs to connect to "failed" and "problem" must connect to "solve", so "problems to solve" cannot be an element of a strict "and" list. This phenomenon does occur (although rarely) in formal English, so we would like to solve it. The problem remains in our current system. SUBSCRIPTS. How should subscripts (on the connector names) be dealt with? When two or more connectors with different subscripts are combined with "and", they may only connect to a connector that may connect to all of them. For example, consider the following dictionary: a Ds+ the D+ those Dm+ cats dogs Dm- cat dog Ds- Among the determiners above only "the" can grammatically be allowed to modify the "and" list "cats and dog". This is because the only connector which matches Dm- and Ds- is D+, not Ds+ or Dm+. This is the solution we implement. There is an exception to be handled here, however. The system we've described so far would accept "the dog and cat runs", while rejecting "the dog and cat run". Both of these judgements are wrong because in English whenever two singular subjects are " anded" together they become plural. We have incorporated this exception: "Ss+" connectors, when "anded" together, may connect to an "Sp-", but not to an "Ss-". Another problem arises with embedded clauses. Consider the following linkage of the sentence "I think John and Dave ran". +-S-+--C--+-----S------+ | | | | I think John and Dave ran +-S--+ | | I think John and Dave ran This linkage is a combination of the following two sentences "I think John ran" and "Dave ran". This linkage should clearly be rejected. Intuitively, the problem with this linkage is that the same "S" link (the one between "and" and "ran") is being used to mean something that "I think" (" John ran") and also something that is just a fact (" Dave ran"). We have devised a system for detecting such patterns, using post-processing. As mentioned above, we handle conjunction sentences by expanding the original sentence into several subsentences. We then compute the domain structure of the resulting linkage of each sentence. Finally, the original linkage is deemed incorrect if the nesting structure of a pair of links descending from the same link ("e.g." the "S" links in the two sentences above) do not have the same domain ancestry (are contained in the same set of domains). The sentence then receives an "and" structure violation. A few other problems should be mentioned. Sometimes adverbs are used with conjunctions: He talked to Steve and, apparently, Fred *He talked to, apparently, Fred As the second sentence shows, it is the conjunction that makes the first adverbial use valid. We have no good way of handling this construction. Secondly, there are some special uses of punctuation with conjunctions. Sometimes, a comma is inserted both before and after the final element in an "and" list (ex.1). And sometimes semi-colons may be used instead of commas, particularly when the and-list elements themselves contain commas (ex. 2). 1. John, and Steve, are coming 2. John; my advisor, Steve; and several other people are coming Thus our system still needs some work in the area of conjunctions. 5. POST-PROCESSING THE LOGIC OF POST-PROCESSING. Besides conjunctions, there are certain phenomena in English which the parser is incapable of dealing with in its basic form. To solve these problems, we developed a post-processing system, based on a concept we call "domains". A domain is a subset of the links that make up a sentence. After a linkage has been found, the post-processing mechanism goes through the linkage and divides the sentence up into domains based on the kind of links that are present in the sentence. It then further divides the links into "groups": sets of links which share a particular domain membership. It then applies rules which may declare the linkage invalid based on the combinations of links present in a given group. THE DOMAIN STRUCTURE. A domain is started by a certain type of link; we call this the "root link". Different types of links start different types of domains. In most cases, the domain of a link consists of all the links in the sentence that can be reached from the right end of the root link, without tracing to the left through the root link itself. For example, assume that the C link in the following sentence begins a domain. This domain will include the Sp link and the I, but not the Ss and the O. +--C---+ +-Ss+O-+ +Sp(e)+I(e)+ | | | | | | He told me they would go (The letters in parentheses indicate that the Sp and I link are in an e-type domain.) In general, this means that the domain of a link will include all the links to the right of the root link. But not always. In the example below, the C link starts a domain; but this only includes the Ss and O links, not the Xc, CO, or Sp. +---------CO---------+ +------Xc---------+ | +-C-+Ss(s)+O(s)+ | +Sp+ | | | | | | | After he saw us , we left In the case below, the domain started by the C link actually extends back to the left of the root link, to contain the B link as well. (This follows naturally from the way domains are defined. The B link can be reached from the right end of the C link, without tracing back through the C.) +---------Bsw(e)------+ | +---I---+ | | +SI+ +-C--+S(e)+ | | | | | | Who do you think you saw If one wants a certain link type to start a domain, it must be included in the list "domain_starter_links" in the file "post-process.c". It must also be included in the following list, ("CDNP CDNP_array"), along with the name of the domain-type that it starts. As we saw above, it is possible for domains to be traced back to the left of the root link. This tends to be prevented, however, by what we call "restricted links". If a link is restricted, this means that if a domain is being traced through it to the left, it will be traced no further than the restricted link. So, for example, Bsw is a restricted link. This means that, in the sentence above, the "e" domain started by the C will contain the Bsw; but if there are any links which can be traced from the left end of the Bsw (i.e., another link coming out of the word "who"), they will not be included in the domain. There are a few other complications in the way domains are generated. 1. The root link of a domain may or may not be included in the domain it starts. Root links will not be included, unless they are listed in a special list in post-process.c, "domain_contains_links". 2. Some domain-types are "bounded domains". This means that they are not allowed to extend to the left of their root link at all (even with a "restricted link"). If they do extend in this way in a linkage, the linkage will be declared invalid. (See "C" in the Guide to Links for an explanation of how this is used.) 3. As well as ordinary domains, there are two special kinds. One kind is "urfl" domains. As well as including everything that can be reached from the right end of the root link, these domains include everything that can be reached from the left end of the root link, tracing to the right, underneath the root link (but not over it). In the example below, the TOo link starts an "urfl" domain; as well as including the "I" link, as a normal domain would, this domain also includes the O. +---TOo--+ +-S-+O(x)+ +I(x)+ | | | | | I asked him to go The final kind of domain is "urfl-only". These include ONLY links that can be reached from the left of the root link, tracing to the right underneath the root link. In the case below, the S link starts an "urfl-only" domain. +-------S-------+ +---O(d)---+ | + +D(d)+ +-O+ | | | | | playing the piano is fun "Urfl" and "urfl-only" domains are defined by the link-types that start them. Any link that is added to the "urfl_domain_starter_links" list will start an "urfl" domain; any link that is added to the "urfl_only_domain_starter_links" will start an "urfl-only" domain. Domains may be nested; a link may therefore be in several domains at once. The domain membership of a given link can be shown in the following way: +----------------------RW---------------------+ +--W-+-S+-P-+-MV-+-C+-S-+-C+-S+--P--+ | | | | | | | | | | | | ///// he got mad when I said I was leaving RIGHT-WALL ///// RW <---RW----> RW RIGHT-WALL (m) ///// Wd <---Wd----> Wd he (m) he Ss <---Ss----> S got (m) got Pa <---Pa----> Pa mad (m) mad MV <---MVs---> MVs when (m) when Cs <---Cs----> C I (m) (s) I Sp*i <---Sp*i--> S said (m) (s) said Ce <---Ce----> C I (m) (s) (e) I Sp*i <---Spii--> S*i was (m) (s) (e) was Pg <---Pg----> Pg leaving (In the structure shown here, the domain structure is strictly hierarchical; every domain that is partially inside another is completely inside that domain. There is no a priori reason why domain structure should always be strictly hierarchical; but we believe that, given the current grammar, it always will be strictly hierarchical, at least in sentences that are otherwise judged as acceptable by the parser.) GROUPS. The domain structure is really a means to creating a more useful kind of structure. This is the "group". A "group" of links is the set of links that have the same domain membership. In the above example, then, the "Spii" and "Pg" are part of the same group. The "Ce" and "Sp*i" are in another group. The Ce and Pg are not in the same group. Groups correspond roughly to subject-verb expressions - groups of links that are part of a clause, but not part of any subordinate or dependent clauses within that clause. For example, in the above case, "He got mad" is one subject-verb expression; "I said ..." is another, "I was leaving" is a third. POST-PROCESSING RULES. The domain structure thus divides a sentence into groups of links. This then allows us to enforce constraints on the link-types that are in a subject-verb expression. This is useful in cases where there are constraints on the combinations of links that can be present in a clause, but the links may be separated (i.e., they may not all connect to the same word), making the constraints difficult to enforce using link logic. We do this by using certain kinds of rules. One is the "contains_one" rule. This says that if a group contains a link of a certain type, it must contain exactly one link of another type. Another kind of rule is the "contains_none" rule. This says that if a group contains one kind of link, it may not contain any of a certain kind of link. In each case, we have a "triggering" link-type: a link that triggers the rule, and enforces a certain constraint. We also have a "criterion" link-type: a link that defines the constraint (whether it is "must contain X" or "may not contain X"). The triggering link for a rule must be contained in a line of the form PP(!contains_one("X", Blah),"Warning!"); Where "X" is the triggering link-type, and "Warning" is a message that will be outputted when the rule is violated. "Blah" is a symbol for the criterion link or links for the rule. These are then listed in a line of the form: static char * Blah[] = {"Y", "Z", NULL}; where "Y" and "Z" are the criterion links for the rule. A contains_one rule means, "the group must contain exactly one of all the link-types on this list". A contains_one rule means, "the group may not contain any of the link-types on this list". LINK-TYPE MATCHING IN POST-PROCESSING. Link-type matching in post-processing requires some explanation. In the dictionary, subscripts are used to create sub-categories of connector-types. "Ss+" will link with "Ss-" and "S-", but not "Sp-". The character "*" is used as a wild-card; it will match to any character. An unsubscripted connector name, like "S+", can thus be regarded as equivalent to "S***...+". The post-processor also requires a system for matching connector types. While the linkage stage is looking at link-types on connector expressions, post-processing is looking at the resulting link-types that are formed when a linkage is complete. (If an "Ss+" has linked together with a "S*a-", what the post-processor sees is "Ssa".) It is then comparing them to link-names listed in "post-process.c" (as domain-starting links, as triggering links for rules, as criterion links for rules, etc.). The link-type matching system used in post-processing is similar to the linkage-level one, but it is a little different. As mentioned above, from the dictionary's point of view, any empty subscript places are considered to be filled with "*"'s. Post- processing does the same thing. If it sees an "S" as a link-type, it regards this as "S***...". But, rather than regarding "*" as a wild-card, the post-processor regards it as just another character. From the p.p.'s point of view, an "S" in a linkage does _not_ match with an "Ss" in post-process.c; the "S" is regarded as an "S*", and this is distinct from "Ss". Nor does an "S" in post-process.c match with an "Ss" in a linkage. However, there is also a wild-card character in post-processing; this is "#". An "S#" in post-process.c does match with a "Ss" in the dictionary, as well as with an "S". In short: in both the dictionary and post-process.c, an empty subscript place will be regarded as a "*". But in the post-process.c, the "*" is _not_ a wild-card character; the "#" is the wild-card. The specific uses of post-processing are fully explained in the "Guide to Links". For example, see "SF: filler-it"; "SI", and "MV: comparatives".