Note: This page is historical.

Current pages about Yenta are here. Please look at those pages first.

Yenta is still under active development, but this particular page is not. If you're interested in current research papers about Yenta, or obtaining a copy of Yenta, please start here instead.

This page is one of many that were written in late 1994 and early 1995, and are being preserved here for historical purposes. If you're viewing this page, you probably found it via an old link or are interested in the history of how Yenta came to be. These pages have not been actively maintained since 1995, so you'll find all sorts of older descriptions which may not match the current system, citations to old papers and old results, and so forth.

Vector-space clustering

Vector-space clustering, exemplified by packages such as SMART, represents each chunk of text as a point in a multidimensional space. Each word in the chunk is weighted by its "informativeness", so that words which help to differentiate one chunk from another (e.g., relatively rare words) count proportionately more than very common words (which are in most documents).

Given this paradigm, we can take the Euclidean distance between two points represented by two different chunks of text as their "similarity".

Lenny Foner

Last modified: Tue Dec 13 22:30:38 1994