Word sense disambiguation

In computational linguistics, word sense disambiguation (WSD) is the process of identifying which sense of a word is used in any given sentence, when the word has a number of distinct senses.

For example, consider two examples of the distinct senses that exist for the word bass:

To a human, it is obvious that the first sentence is using the word bass, as in the former sense above and in the second sentence, the word bass is being used as in the latter sense below. Developing algorithms to replicate this human ability can often be a difficult task.

Difficulties

One problem with word sense disambiguation is deciding what the senses are. In cases like the word bass above, at least some senses are obviously different. In other cases, however, the different senses can be closely related (one meaning being a metaphorical or metonymic extension of another), and in such cases division of words into senses becomes much more difficult. Different dictionaries will provide different divisions of words into senses. One solution some researchers have used is to choose a particular dictionary, and just use its set of senses. Generally, however, research results using broad distinctions in senses have been much better than those using narrow, so most researchers ignore the fine-grained distinctions in their work.

Another problem is inter-judge variance. WSD systems are normally tested by having their results on a task compared against those of a human. However, humans do not agree on the task at hand — give a list of senses and sentences, and humans will not always agree on which word belongs in which sense. A computer cannot be expected to give better performance on such a task than a human (indeed, since the human serves as the standard, the computer being better than the human is incoherent), so the human performance serves as an upper bound. Human performance, however, is much better on coarse-grained than fine-grained distinctions, so this again is why research on coarse-grained distinctions is most useful.

Approaches

As in all natural language processing, there are two main approaches to WSD — deep approaches and shallow approaches.

Deep approaches presume access to a comprehensive body of world knowledge. Knowledge, such as "you can go fishing for a type of fish, but not for low frequency sounds" and "songs have low frequency sounds as parts, but not types of fish", is then used to determine in which sense the word is used. These approaches are not very successful in practice, mainly because such a body of knowledge does not exist in a computer-readable format, outside of very limited domains. However, if such knowledge did exist, then deep approaches would be much more accurate than the shallow approaches. Also, there is a long tradition in computational linguistics, of trying such approaches in terms of coded knowledge and in some cases, it is hard to say clearly whether the knowledge involved is linguistic or world knowledge. The first attempt was that by Margaret Masterman and her colleagues, at the Cambridge Language Research Unit in England, in the 1950s. This attempt used as data a punched-card version of Roget's Thesaurus and its numbered "heads", as an indicator of topics and looked for repetitions in text, using a set intersection algorithm. It was not very successful, as is described in some detail in (Wilks, Y. et al., 1996), but had strong relationships to later work, especially Yarowsky's machine learning optimisation of a thesaurus method in the 1990s.

Shallow approaches don't try to understand the text. They just consider the surrounding words, using information such as "if bass has words sea or fishing nearby, it probably is in the fish sense; if bass has the words music or song nearby, it is probably in the music sense." These rules can be automatically derived by the computer, using a training corpus of words tagged with their word senses. This approach, while theoretically not as powerful as deep approaches, gives superior results in practice, due to the computer's limited world knowledge. Though, it can be confused by sentences, like The dogs bark at the tree, which contains the word bark near both tree and dogs.

These approaches normally work by defining a window of N content words around each word to be disambiguated in the corpus, and statistically analyzing those N surrounding words. Two shallow approaches used to train and then disambiguate are Naive Bayes classifiers and decision trees. In recent research, kernel based methods such as support vector machines have shown superior performance in supervised learning. But over the last few years, there hasn't been any major improvement in performance of any of these methods.

It is instructive to compare the word sense disambiguation problem with the problem of part-of-speech tagging. Both involve disambiguating or tagging with words, be it with senses or parts of speech. However, algorithms used for one do not tend to work well for the other, mainly because the part of speech of a word is primarily determined by the immediately adjacent one to three words, whereas the sense of a word may be determined by words further away. The success rate for part-of-speech tagging algorithms is at present much higher than that for WSD, state-of-the art being around 95% accuracy or better, as compared to less than 75% accuracy in word sense disambiguation with supervised learning. These figures are typical for English, and may be very different from those for other languages.

Another aspect of word sense disambiguation that differentiates it from part-of-speech tagging is the availability of training data. While it is relatively easy to assign parts of speech to text, training people to tag senses is far more difficult ^[1]. While users can memorize all of the possible parts of speech a word can take, it is impossible for individuals to memorize all of the senses a word can take. Thus, many word sense disambiguation algorithms use semi-supervised learning, which allows both labeled and unlabeled data. The Yarowsky algorithm was an early example of such an algorithm.

Yarowsky’s unsupervised algorithm uses the ‘One sense per collocation’ and the ‘One sense per discourse’ properties of human languages for word sense disambiguation. From observation, words tend to exhibit only one sense in most given discourse and in a given collocation. The corpus is initially untagged.

The algorithm starts with a large corpus, in which it identifies examples of the given polysemous word, and stores all the relevant sentences as lines. For instance, Yarowsky uses the word ‘plant’ in his 1995 paper to demonstrate the algorithm. Assume that there are two possible senses of the word, the next step is to identify a small number of seed collocations representative of each sense, give each sense a label, i.e. sense A and B, then assign the appropriate label to all training examples containing the seed collocations. In this case, the words ‘life’ and ‘manufacturing’ are chosen as initial seed collocations for sense A and B respectively. The residual examples (85% - 98% according to Yarowsky) remain untagged.

The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively. This can be done by selecting seed words from a dictionary’s entry for that sense. The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance. According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected. The effect is much stronger for words in a predicate-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words. Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus. This could give the word different rankings or even different classifications. Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words. A publicly available database called WordNet can be used as an automatic source for such defining terms. In addition, words that occur near the target word in great frequency can be selected as seed collocations representative. This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.

A decision-list algorithm is then used to identify other reliable collocations. This training algorithm calculates the probability P(Sense | Collocation), and the decision list is ranked by the log-likelihood ratio:

A smoothing algorithm will then be used to avoid 0 values. The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.

The new resulting classifier will then be applied to the whole sample set. Add those examples in the residual that are tagged as A or B with probability above a reasonable threshold to the seed sets. Apply the decision-list algorithm and the above adding step iteratively. As more newly-learned collocations are added to the seed sets, the sense A or sense B set will grow, and the original residual will shrink. However, these collocations stay in the seed sets only if their probability of classification remains above the threshold, otherwise they are returned to the residual for later classification. At the end of each iteration, the ‘One sense per discourse’ property can be used to help preventing initially mistagged collocates and hence improving the purity of the seed sets.

In order to avoid strong collocates becoming indicators for the wrong class, the class-inclusion threshold needs to be randomly altered. For the same purpose, after intermediate convergence the algorithm will also need to increase the width of the context window.

The algorithm will continue to iterate until no more reliable collocations are found. The ‘One sense per discourse’ property can be used here for error correction. For a target word that has a binary sense partition, if the occurrences of the majority sense A exceed that of the minor sense B by a certain threshold, the minority ones will be relabeled as A. According to Yarowsky, for any sense to be clearly dominant, the occurrences of the target word should not be less than 4.

When the algorithm converges on a stable residual set, a final decision list of the target word is obtained. The most reliable collocations are at the top of the new list instead of the original seed words. The original untagged corpus is then tagged with sense labels and probabilities. The final decision list may now be applied to new data, the collocation with the highest rank in the list is used to classify the new data. For example, if the highest ranking collocation of the target word in the new data set is of sense A, then the target word is classified as sense A.

Word sense disambiguation

Contents

Difficulties

Approaches

See also

Notes

References

External links