COMPASS, for "Comparativist's Assistant," is an algorithm that was developed by Donald Frantz (1970). The algorithm is based on the comparative method that linguists have long used for determining genetic relationship between languages and for reconstructing the proto-language from which related languages descended. The comparative method is based on the observation that the sounds of a language change over time in systematic and regular ways. COMPASS measures the frequency with which proposed sound correspondences recur as a means of evaluating the likelihood that forms entered as cognate in the word list database are in fact historically cognate.
Cognate forms are defined as those forms among related dialects or languages that have descended from the same parent form. The exact determination of cognates is possible only after an application of the comparative method. In the comparative method, forms thought to be cognate are compared to discover the regular ways in which the sounds of the languages involved correspond. If all the sound correspondences in a pair of forms can be shown to be regular, then they can safely be said to be cognate. The comparative method usually goes a step further to reconstruct the sound system of the parent language (by examining all the correspondence sets and their conditioning environments, if any) and its vocabulary (by noting the correspondences which occur in cognate sets).
The COMPASS algorithm is only an approximation of this process, in that it neither takes into account the environment of each sound nor attempts to derive reconstructions. The central assumption in the algorithm is that the regularity of a sound change is shown through its frequency of occurrence in the data.
For example, consider a hypothetical language, Proto-AB, which at some time split to form two daughter languages, A and B. Furthermore, suppose that the parent language contained the phone b, which was retained as b in language A, but which became both b and p in language B as the result of a sound change. In language B, all word initial b's became voiceless.
Now consider that the two daughter languages contain the following cognate pairs that demonstrate the above process:
Note that in these data, the b/p correspondence occurs twice and the b/b correspondence occurs twice.
If the word lists from these two languages had several hundred cognate pairs, these two correspondences would occur many times. The comparative method works on the assumption that the more often a correspondence occurs in the data, the more likely it is to represent a regular sound change. When a correspondence occurs only a few times, it could still be a regular one with a very low frequency of occurrence. (This is a valid conclusion when there is not a more frequent correspondence involving one of the same phones in the same environment.) However, when a correspondence occurs only a few times, it is more likely not to be reflecting a regula r sound change at all. It could be due to incorrectly transcribed data or to a chance irregular change. However, the most usual explanation is a chance similarity of unrelated forms or the borrowing of a form from one language to the other. In either case, the forms are not in fact cognate.
Returning to the example, suppose that an examination of larger word lists reveals that the b/p correspondence occurs 46 times, b/b occurs 72 times, and b/d occurs only once. One would conclude that there are two regular correspondences, but that the third is questionable. Unless the pair of words exhibiting the b/d demonstrates other correspondences that could not be due to chance or direct borrowing, that pair of words should no longer be considered as cognate.
Note that since COMPASS does not take context into consideration in its frequency counts, low frequency correspondences may be hidden among the more frequent ones. For instance, in the above example the b/p correspondence is valid only in the word-initial position. If a b/p correspondence were to occur word-medially, the computer would not notice, and this highly suspect correspondence (along with its supporting cognate sets) would be lumped together with the valid ones.
The COMPASS algorithm
is a four-step process. The first step is what has just been described: the generation of a table
of phoneme correspondences. For example, consider two word lists for languages L1 and L2.
Given the following pair of cognate words, COMPASS would go through both, character by character
, and compile the following tally of
The second step of COMPASS is to assign a strength index to each of these correspondences. The resulting values are shown in the third column of figure 5.3. The strength index is a number ranging from +1.0 to -1.0 which represents the likelihood that the correspondence is the result of a regular sound change. A value of +1.0 represents maximum confidence that it is regular; a value of -1.0 represents maximum confidence that it is not. Values between the two extremes represent intermediate degrees of likelihood. The strength index is computed from the number of times the correspondence occurs by means of the following formula:
The upper threshold is the number of occurrences beyond which we are quite certain that the correspondence is regular. The bottom threshold is the number of occurrences below which we are quite certain that it is not. The lower threshold is the number of occurrences at which we think it quite unlikely that the correspondence is regular, but we are not so sure that we want to impose the full maximum negative penalty. The default settings5 for these three threshold values are:
With these settings in effect, a correspondence with 15 or more occurrences scores a maximum strength of 1. A correspondence with only 1 occurrence scores a maximum negative strength of -1. A correspondence with 2 occurrences scores a medium negative strength of -0.5. Correspondences with between 3 and 14 occurrences score a positive strength between 0 and 1 which grows proportionately with the number of occurrences. The strength indices in figure 5.3 were computed with the default threshold values.
Besides allowing the user to set the threshold values, WORDSURV has a feature by which threshold values may be automatically adjusted for the number of word pairs compared. As the amount of data increases or decreases, it is reasonable to expect that threshold values should vary accordingly. The default threshold values are based on a list of 100 cognates (Frantz 1970). The automatic adjustment is a logarithmic proportion using base 10 logs. Thus, if there are ten times as much data, the threshold values are doubled. If 200 cognates are compared, the thresholds are multiplied by 1.3; for 50, they are multiplied by 0.7. Rather than depending on WORDSURV to calculate adjusted thresholds, you can set customized threshold values for your situation directly. The commands for setting thresholds are described in section 5.5.3.
The third step of COMPASS is to calculate the average correspondence strength for each word pair. This is done by adding the strength index for each correspondence in a given word pair, and then dividing by the number of correspondences in that pair. Like the strength index for correspondences, this value ranges from +1.0 to -1.0. When the average strength for a word pair is positive and high, we can continue to assume they are cognate with justified confidence. When the average strength is negative, there is not enough evidence to justify the claim that the words are cognate; the database should be updated to put the words in different cognate sets. Proposed cognates with low positive strength are borderline cases; further recurrence in more data or the discovery of a regular conditioning environment are needed to determine if the correspondences involved are truly regular or not.
It should be clear that when COMPASS is used, there is no real danger of being too liberal with initial cognate set assignments as the data are entered. In fact, when in doubt, the analyst should group forms into the same cognate set, since COMPASS rejects the decision if it is unwarranted. On the other hand, if questionable forms are separated as noncognate, COMPASS will never consider them, thus missing the chance to see if there is adequate recurrence of sound correspondences to call them cognate.
The fourth step of COMPASS is to produce a summary table showing the number of word pairs within given ranges of strength. An example is shown in figure 5.5.
This table shows that where the analyst considered 54 word pairs to be cognate, one has negative strength and is clearly considered questionable by COMPASS. Others appear to be borderline.
The computed strength values depend very much on the threshold values. If the values in the summary display are skewed too much toward one end of the scale or the other, then the thresholds should be changed to counteract the skewing. The user will probably want to experiment with these values.
Wimbish, John S. 1989. WORDSURV: A Program for Analyzing Language Survey Word Lists, pages 67-74. Occasional Publications in Academic Computing, number 13. Dallas, TX: Summer Institute of Linguistics.