Note: This page is historical.

Current pages about Yenta are here. Please look at those pages first.

Yenta is still under active development, but this particular page is not. If you're interested in current research papers about Yenta, or obtaining a copy of Yenta, please start here instead.

This page is one of many that were written in late 1994 and early 1995, and are being preserved here for historical purposes. If you're viewing this page, you probably found it via an old link or are interested in the history of how Yenta came to be. These pages have not been actively maintained since 1995, so you'll find all sorts of older descriptions which may not match the current system, citations to old papers and old results, and so forth.

Semantic-net clustering

Semantic-net clustering, exemplified by packages such as WordNet, embed individual words in a semantic net: given any particular word, there are pointers to related words {e.g., synonyms, antonyms, hypernyms [a hypernym of "puppy" is "dog"], and so forth).

In most cases, it is also necessary to know, for each given word, its part of speech---this helps constrain the otherwise enormous growth of the semantic net around any individual word (for example, is the word being used as a noun or a verb?). The Xerox part-of-speech tagger is an example package of this kind.

Given the local semantic net around any individual word, one can compare two words by attempting to traverse semantic links from one word to another. The number of links in the shortest such path (after all, there may be multiple paths from one word to another) is a measure of similarity, where very short paths (e.g., one link) are considered similar.

To compare two chunks of text, then, one compares the semantic distance between pair of words across the two chunks, computing an overall distance. (The details of this comparison are not yet specified).


Lenny Foner
Last modified: Sun Dec 11 16:48:57 1994