CHAPTER 4 The Sample Application: Yenta 4.1 Introduction This chapter describes the sample application, named Yenta, that has been developed as a part of this research. The prior chapters are essential background for this discus- sion. Chapter 5 will evaluate the architecture and this sample application. In this chapter, we shall describe: Section 4.2 o The purpose of the application -- what problem does Yenta solve? Section 4.3 o Some sample scenarios -- why might Yenta be useful? Section 4.4 o Yenta's affordances -- what can users do with it? Section 4.5 o Political considerations -- why this application in particular? We shall then turn our attention to details of Yenta's implementation, and address: Section 4.6 o Yenta's implementation languages and internal organization Section 4.7 o How Yenta determines its user's interests Section 4.8 o How Yenta's security works 4.2 Yenta's purpose Yenta has two primary purposes Matchmaking o To serve as a distributed matchmaking system that can introduce users to each oth- er, or form coalitions and discussion groups into which users may send messages to groups of others who share their interests (see Section 4.4). Getting the word out o To raise public awareness for the political ideas about trustworthiness and protec- tion of personal privacy advanced elsewhere in this thesis (see Chapter 1 and Section 4.5). 4.3 Sample scenarios Before we examine exactly what Yenta can do, let us consider some sample scenarios. You write software, and you've having trouble with a particular tool. Somebody else just down the hall is using the same tool as part of what they're doing. But even though both of you talk every day, neither of you knows this -- after all, this tool is just a little part of your job, and you don't tell everybody you meet about every single thing you do all day. Yenta can tell you about this shared interest. You're a technical recruiter. You'd like to find companies looking for people to hire, and people who are looking to be hired for your existing clients. They need privacy and anonymity, so the people they're working for now don't know they're looking. You need to be able to show them a good reputation, backed up by satis- fied clients. Yenta is private and secure, and has a reputation system. Everybody's happy. You're a doctor doing some research on an rare condition. Another doctor is doing the same sorts of research, but you don't know about each other. Maybe you're an academic, but you don't have enough to publish yet. Or perhaps you're a clinician, and don't realize that you're looking at a small part of a much bigger public-health problem. Yenta can help bring the two of you together, along with others who are studying the same problem. You have an unusual interest, but you can't find anyone else who seems to share it. Maybe it's something embarrassing, that most people don't want to talk about publicly, so doing a web search hasn't turned up much. Yenta can help find others who share the interest, even if they don't publish about it. And it can keep the interest private, to only those who trust each other. What do these scenarios all have in common? Users who may or may not know each other, but who do not know that they share an interest in something. Also, some of them depend on the existence of the reputation system, or upon the pseudonymous nature of how Yenta users are identified to each other. 4.4 Affordances Let us now turn to Yenta's affordances, meaning exactly what functionality is made available to its users. This description is mostly from a user's standpoint -- here, we describe more about what the user finds available in Yenta's set of possible actions, and less about how Yenta manages to do them. 4.4.1 User interface Yenta communicates with its user by sending HTML to a particular network port, and instructing its user to connect to that port with a web browser. With the exception of the very first message from Yenta, in which it tells the user what URL to use, Yenta uses the user's web browser exclusively for its interactions. This has several major advantages: Portability o Supporting a graphical user interface is a tremendous amount of work, and is gen- erally extremely non-portable across different types of computers. HTML, howev- er, is extremely portable, provided that a lowest-common-denominator subset -- es- sentially, that which has been approved by various standards bodies -- is used. There are web browsers available for virtually all general-purpose computers in the world. Familiarity o By using HTML, Yenta can present its interface using a paradigm already well- known by millions of potential users. Configurability o If the user disagrees with some aspects of the UI -- in issues such as font size, screen background, and so forth -- it is generally possible to use the browser to change these, without having to support it directly in Yenta. Security o Yenta requires a high-security path from the user to Yenta itself, so that the user may type his or her passphrase without inordinate chance of it being eavesdropped. Common browsers support high-strength (128-bit session key) SSL connections, and Yenta uses cryptography exclusively when communicating with its user. We will have more to say about this in Section 4.8. 4.4.2 Yenta runs forever Once Yenta has been started, it effectively runs forever. It disconnects from the con- trolling shell, and becomes a background process. While the user can manually shut down Yenta from its user interface, this is discouraged, since it prevents other Yentas from communicating with the user's Yenta when the user is not attending it. This, in turn, means that Yenta will not perform as well as it could -- it will miss opportunities for clustering and for passing or receiving messages. (Future versions of Yenta may not require permanent network connections, and will be more suitable from intermit- tently-connected machines, such as the dial-up connections employed by most home computer users.) Checkpointing Yenta checkpoints its state to disk periodically, and when it is shut down. This means that a machine crash can only lose a small amount of data; how often these snapshots occur, and thus the maximum amount of unsaved state that might exist, is config- urable. For details about how this data is saved, and what precautions are taken to ensure both robustness and privacy, see Section 4.8. 4.4.3 Handles Users of Yenta are identified by two types of names. The Yenta-ID was described in Section 3.4.1, and is essentially a 160-bit random number. This number is the funda- mental way in which Yentas identify themselves to each other, and is both unspoofa- ble and unique, as described previously. YID's are precise, but cumbersome A Yenta-ID is an unfriendly way to name entities which people must interact with -- people are notoriously bad at remembering random 160-bit strings; they are very dif- ficult to type; and they are more unique than is required almost all of the time. Fur- thermore, users generally prefer some degree of personalization of their online identities, and being able to choose their own name is a fundamental aspect of this. Handles are nicknames chosen by users Hence, Yenta also makes available a handle, which each user may set as he or she pleases. Handles are not guaranteed to be unique across any particular set of Yentas -- indeed, since it is assumed that no Yenta knows of all other Yentas in the world, this seems impossible on its face. Handles provide a convenient shorthand when a user must refer to a particular other Yenta, and offer some degree of a chosen identity. Local nicknames for others Because handles are not guaranteed unique, users may also examine the Yenta-ID for a particular Yenta they communicate with, to avoid ambiguity. In addition, Yenta sup- ports the ability to make local nicknames for any other Yenta's handle. This means that, if the user Sally is talking to some other user whose handle is Joe, but finds that she does not want to use that handle -- either because she already knows two Joes, or because she simply doesn't like the name -- she may instruct her Yenta to refer to the Yenta known elsewhere as Joe by some other name, such as Fred. This causes no con- fusion to other Yentas, which only refer to each other by YID anyway, and is invisible to everyone but Sally, unless she happens to mention her local, private nickname for Joe to anyone else. 4.4.4 Determining user interests When Yenta first starts up, and periodically afterwards, it determines what the user is actually interested in. Without this determination, Yenta is useless -- it would have no basis for which clusters to join, what introductions to make, and so forth. Documents Yenta uses a collection of documents to determine what a user is interested in. A sin- gle document is generally either a single file -- if the file consists of plain text -- or a single email message -- if the file consists of several email messages grouped into a single file, as is popular with many mail-handling tools. Yenta can automatically determine, by analyzing the contents of the file, what sort of file it is, and whether it consists of a single document or several. The internal representation of a document is described in Section 4.7. Scanning a tree When Yenta starts up for the very first time, it asks the user for the root of a file tree. It then walks every file in that tree, rejecting those that appear to be binary files, and also rejecting portions of those files that appear uninteresting -- email signature lines, the stereotyped wording of header fields in email, HTML tags, PGP signatures, and so forth. It then clusters the resulting documents, as described in Section 4.7. Single files Once this initial clustering has taken place, users also have the option of directing Yenta's attention to a particular single file. This single file can be used to express a particular interest, perhaps obtained by the user importing a single document from elsewhere, and telling Yenta to give it disproportionate weight. In part, this makes it somewhat easier for users to express an interest in a particular subject so that they may find a group of experts. Rescanning periodically In addition, Yenta can be told to periodically resurvey the files it has already scanned. This allows it to pick up new interests as files are modified -- files of email are typical for this. Interests from documents (entire files, or particular email messages) which are older than a user-settable threshold can be dropped, so that if the user loses inter- est in a topic, Yenta will stop trying to cluster based on it. Giving Yenta feedback Once Yenta has determined the user's interests, and at any time afterward that the user chooses, the user can survey the listing of accumulated interests and tell Yenta which ones are actually useful, and which ones are not. This has two important advantages. First, interests which were incorrectly determined by Yenta -- such as a set of docu- ments which contain some stereotyped text in each one, and which hence were clus- tered together -- can be rejected. Second, not everything Yenta might find is equally important to the user. A common example is that of meetings: Most users working in white-collar environments in which meetings are scheduled by email will end up with a cluster containing words from messages such as room, date, time, schedule, meet- ing, and so on. For most people, just because someone else has meetings -- on any topic -- is no reason to suggest an introduction. An example from Yenta's user interface of a set of interests is presented in Figure 6. 4.4.5 Messaging Once Yenta has determined its user's interests, it engages in the clustering algorithm described in Chapter 2 to find other Yentas which share one or more of its user's inter- ests. As soon as Yenta finds itself some clusters of others, it allows the user to send messages. These messages may be of two types: o One-to-one. In this case, the user sends a message to a single other Yenta, which receives it and (usually) displays it to its user. o One-to-cluster. In this case, the user sends a message to all the other Yentas in one of the clusters of which this Yenta is a member. Examples from the Yenta UI may be found in Figure 7, Figure 8, and Figure 9. The implementation of how message-passing works is described in Chapter 2. Grouping and filtering Yenta users may group their messages by who has sent them, when they arrived, and so forth; the functionality resembles that of a typical mail-reading program. In addi- tion, they may establish filters, which control which messages from other Yentas will be shown. These filters can screen out messages which do (or do not) contain certain regular expressions in their contents. In addition, rules can be written which use the attestation system (see below) to determine whether or not to present a message based on the reputation of its sender. This allows users to avoid seeing spam without ever seeing even the very first message from the sender -- by instructing Yenta to ignore messages which do not meet some reputation criteria, spammers who fail to acquire a good-enough reputation become invisible to the given user. Of course, care must be taken in writing such rules, lest most other people be inadvertently lumped into the group of potential spammers. 4.4.6 Introductions If one user's Yenta determines that some other Yenta seems unusually close in inter- ests to one of its users clusters -- better the characteristics match within a user-settable threshold -- it can suggest an introduction. This suggestion takes the form of an auto- matically-generated message to both Yentas. The Yenta suggesting the introduction sends a message to the other Yenta saying, in effect, I think we should be introduced, and also, in effect, sends its user a message saying, I think you should introduce your- self to this other user. Users are free to accept or ignore such introductory messages, and may configure Yenta to increase or decrease the approximate frequency of their occurrence. Introductions serve the important purpose of getting users together who do not other- wise know of each other's existence. After all, if user A sends a message to B, then A must have known about B first. Similarly, if user A never sends any messages, even to a whole cluster, then A is effectively invisible to everyone else in the cluster. Introduc- tions serve as a way to suggest to such lurkers that they interact with particular other individuals. 4.4.7 Reputations Chapter 2.11 described the basic features of the attestation system, in which users may create strings of text describing themselves, and others may cryptographically sign these strings. Yenta supports the creation, display, and signing of attestations, and users may use these attestations to filter incoming messages based on who has signed the attestation or which strings appear in an attestation. The actual things that users say about themselves via this reputation system constitute a set of social mores. The final development of this set is unknown; it is very often the case that small initial perturbations can lead to large eventual changes in what are considered common customs, idioms, and the like [20][33][49][59][60][116]. The study of how Yenta's users actually use the reputation system could be very fruitful from a sociological standpoint. See Figure 10 for an example from Yenta's UI of how attestations are seen by the user. 4.4.8 Bookmarks It is often convenient to be able to mark a spot in the user interface with a bookmark, similarly to the way that one can bookmark a page at a static website. However, Yenta makes this more complicated than it might appear, because there may be more than one Yenta -- each belonging to a different user -- running on the same computer at the same time. As explained in Chapter 2.12, this means that each Yenta must use a differ- ent network port to communicate with its user -- but browser bookmarking systems include the port as part of the URL. Consider what happens when the user drops a bookmark on some page of a running Yenta. When that Yenta is later restarted -- after a machine crash, or because the user shut it down to start running a newer version -- there is no guarantee that it will acquire the same port. Any browser bookmarks will therefore be invalidated, pointing either at the wrong Yenta, or no Yenta at all. To avoid this, Yenta has its own, internal bookmarks, which may point at any page served by the user interface. Users can add or delete bookmarks, and may sort them either alphabetically by page title, or chronologically by when they dropped them. Since these bookmarks are kept internally by Yenta, the details of which port Yenta happens to be currently using for its HTTP server are irrelevant. 4.4.9 News Yenta occasionally has something to say to the user that is unrelated to anything the user has done recently, and is also not an incoming message. For example, someone may have recently signed one of the user's attestations, and their Yenta has just con- nected and passed it along. Or Yenta may have decided to rescan the user's docu- ments, based on instructions to do so periodically, and may wish to inform the user that this has taken place. In these cases, Yenta makes available a page of news. Each item on this page is a brief description of some event that has taken place. Users may review the items, and then tell Yenta to either keep each one or discard it. See Figure 11. 4.4.10 Help Yenta contains a large number of pages documenting its operation. Users may select such pages at any time. The help system understands which page the user was just viewing, and can sometimes offer a specific help topic that would be relevant to the page the user just came from. However, users can always see all available help topics at any time. See Figure 12. 4.4.11 Configuration Users may tune certain parameters in Yenta to make it more to their liking. For exam- ple, certain thresholds, or the details of what constitutes a file which Yenta should ignore during scanning, may not be correct for all users in all environments. Yenta allows users to adjust the values of these parameters. See Figure 13. 4.4.12 Other operations A few infrequently-used operations are gathered together on a single page; see Figure 14. These include, for example, allowing the user to change his or her pass- phrase. In addition, this is how the user can cleanly shut down Yenta by hand, for example if the host machine is about to be taken down. Failure to shut Yenta down in this circumstance means that any changes to its state -- such as incoming messages -- since the last automatic checkpoint will be lost. The very last page presented by Yenta's user interface in this case is shown in Figure 15. 4.5 Politics There are several reasons why this particular sample application was chosen to illus- trate the political goals of this research. First, by basing its assessment of user interests on users' own electronic mail, Yenta starts with a set of data that is already quite likely to be considered private by its users. Because Yenta thus deals with private information so heavily, a solution which does not make the usual compromises -- weak or no encryption, and a central server which collects everything -- was imperative. Without such a solution, user acceptance of Yenta would be slight. There is a great pent-up demand for the problem that Yenta attempts to solve -- namely, matchmaking people and finding interest groups. For example, at one time, Yenta was nothing but a set of proposals, some research papers on simulation results, and a vaporware description of what its implementation would probably look like. Nonetheless, the author received (and continues to receive) several hundred messages every year asking for a copy of the application. Even though, at the beginning, deployment of the application was stated to be quite some time away, response to this otherwise-unadvertised potential application was impressive. This combination of private information, an architectural solution, and great user demand means that the Yenta application can itself be an exemplar, which by its very existence advertises that it is possible to offer the service that it does without the tradi- tional compromises that users have come to expect. In addition, the matchmaking that Yenta does -- allowing people to communicate more easily -- is itself a social good, irrespective of its intended effect on later applications designed by others. Of course, this stance does not come without a price. For example, Yenta's use of strong cryptography means that the application itself, having been written inside the United States, may not legally be exported outside the United States and Canada [50][87]. This complicates Yenta's deployment -- it requires that the distribution site run a script that checks the location of the user requesting the download, and ensures that the user at least professes not to be interested in violating US export-control reg- ulations. Furthermore, it means that Yenta may not be mirrored by other sites, unless they arrange to do the same. 4.6 Implementation details Yenta is actually implemented as four major subsystems: o The cryptographic engine, SSLeay [186]. o The document feature extractor, Savant [146]. o The Scheme interpreter, SCM [89]. o The main functionality of the application. Safety via code reuse In general, the strategy applied was to reuse, not rewrite, those components that Yenta required and that were already freely available. Not only is this the expedient course of action, in the case of Yenta's cryptographic elements, it is also the safest -- crypto- graphic software required careful review, because even a good algorithm and design can be ruined by incorrect implementation. Hence, Yenta does not use its own low- level cryptographic infrastructure -- it uses code that others have carefully reviewed as much as it can. Local modifications to such code, while required to achieve the func- tionality Yenta requires, are made carefully. The resulting system is composed of approximately 240,000 lines of C, and 15,000 lines of Scheme. 4.6.1 The C code The first three of the subsystems above are implemented in C, and come from outside the Yenta project per se. SSLeay [186], which is also used in popular versions of the Apache [7] web server, was written in Australia over a span of many years, and has been vetted by many developers who use it in their own applications. Savant started out in life as the original Yenta document comparison engine. This engine originally used the SMART [188] document comparison engine from Cornell, and later was completely rewritten locally to include only the functionality required by Yenta -- SMART was too large, too buggy, and did not really do what we needed to do. This code then became the basis for the document indexing engine -- Savant -- which itself also a part of the Remembrance Agent [146], and was then handed back to Yenta -- in short, this code has been getting shared and rewritten between two research projects for years. Finally, SCM [89] was written by a guest of the MIT Artificial Intelligence laboratory, again over a period of years. Code reuse is hard We have made our own modifications to all three of these packages, rewriting or extending each one by 10-20% (in terms of lines of code) to make them exactly what Yenta requires. While Yenta's development would have been impossibly complex if all of these packages were to have been written from scratch, its requirements are suf- ficiently unusual that nothing was quite correct out-of-the-box. Savant, for example, required extensive changes so that it did not assume it could touch the disk whenever it wanted (as the Remembrance Agent assumes), and also had little support for the document-clustering that Yenta performs. SCM required major modifications to enable reliable networking, to hook it into the SSLeay crypto API, to not make assumptions about the environment in which it would be run, and to enable shipping a single binary, consistent of the entire application, on a wide variety of machine archi- tectures. Portability Yenta is designed to be easy to port. One of the modifications made to all three of these packages was to place each of them under the GNU autoconf/automake system [109], which allows extremely fast configuration of a C-based system on almost all UNIX hosts. This means that someone who wishes to build Yenta from scratch, in many cases, need type only ./configure; make to build the entire C side of the pack- age. 4.6.2 The Scheme code Most of the unique functionality of Yenta is written in Scheme. This was done for sev- eral reasons: o Scheme, like many Lisp-based languages, solves many traditional problems such as garbage-collection and exception-handling in a clean, elegant way. This is a much larger benefit than it first appears -- in a C program, every line of code is a potential coredump, segmentation violation, or memory leak. Yenta must be robust if its users are to take full advantage of it. One of the easiest ways to ensure this robustness is to write in a language which can correctly handle these details for the programmer. o Scheme is not only safe against crashes, but confers substantial safety against ma- licious attack. Approximately half of all crack attempts against operating systems and applications which are written in C consist of buffer-overrun attacks, in which a deliberately-too-large string is sent to some piece of code which fails to correctly check the size of the buffer for which the data is destined. In the most commonly- used environments, such as attack is over used to overwrite the program control stack and force the application to execute arbitrary code from elsewhere that has been embedded in the data. Scheme cannot fall victim to such an attack, because all such data structures are automatically checked by the interpreter for safety be- fore execution. o Scheme code is quite compact. An informal estimate of Yenta's code, and of similar other projects, indicates that 1 line of Scheme code typically takes the place of 10 or more lines of C code, when integrated over a large project. o Part of Yenta's purpose is pedagogical -- it exists to show how to write distributed, privacy-preserving applications. By writing a large portion of it in Scheme, its un- derlying principles can be more easily revealed without being hidden under a huge amount of otherwise necessary but verbose code. o The SCM implementation runs on a very large selection of platforms, including not only UNIX, but MacOS, MSDOS, Amiga, and others. This means that code written in Scheme is inherently quite portable, and simplifies the task of making Yenta run on a large variety of platforms. o Despite the fact that Scheme is an interpreted language, the SCM implementation used in Yenta has proven itself to be very fast. We have not observed that the user need wait for Yenta, at any point, because of any inefficiencies introduced via the user of an interpreted language. The actual Scheme code of Yenta is roughly divided into several subsystems: o The task scheduler. Yenta internally runs a dozen or more individual tasks. Each task handles one I/O stream, such as communicating with a single other Yenta, or with the user's web browser. In addition, various tasks run autonomously at various times to checkpoint Yenta's state to disk, dump statistics to the statistics-collection server, rescan the user's files for new interests, and so forth. Each task is non-pre- emptive, due to the nature of the SCM implementation -- it must explicitly yield to the next task -- and there is substantial support implemented to make it easy to write tasks in this manner. Some tasks have higher priorities than others -- for example, the user-interface task runs at very high priority, so the user is never left hanging, waiting for a page to load. This reassures the user that Yenta is, indeed, still func- tioning. Finally, tasks which get errors are handled -- this includes saving a back- trace of the task for debugging and sending it to the debugging-log server for later analysis by the implementors. Typically, a single dead task only momentarily inter- rupts communication with a single other Yenta, or disrupts a single browser page fetch, and does not permanently cripple the running Yenta. (Yenta tasks encounter errors only very rarely, and their incidence decreases as Yenta's code becomes more completely debugged. A system with zero bugs, of course, could be expected to never have a task get an error -- but even though Yenta is presumed not to be at that point yet, handling errors in this way makes it much less likely that Yenta fail com- pletely due to an error in one part of itself. This makes Yenta substantially more ro- bust than much existing software.) o The user interface. This code understands how to speak HTTP to a browser, includ- ing using SSLeay to encrypt the connection, and can produce correct HTML for each page to be shown to the user. Pages are written for the most part in plain HT- ML, but they may call out to Scheme code to generate part of the page -- thus, for example, a page may have a constant paragraph of text, and a dynamically-gener- ated table, whose contents are based on Yenta's current state. o The InterYenta protocol engine. This manages communications with other instanc- es of Yenta running elsewhere. o Interest-finding and clustering. Yenta must keep track of the user's interests, and must both communicate those interests to other Yentas, and allow the user to tweak them. o Major affordances. Yenta has a large number of various capabilities -- message origination and reception, attestation management, and so forth. Most of these af- fordances use the code described in previous bullets as infrastructure, and is there- fore relatively compact and easy to implement once the infrastructure is in place. 4.6.3 Dumping Yenta is built in two pieces. First, all of the C code is compiled and linked, yielding a highly-customized version of SCM that also incorporates the Savant and SSLeay libraries. Then, the binary is run, and all of the Scheme code and web pages are loaded into the Scheme heap. Once they have been loaded, Yenta is dumped. This pro- cess creates a single file which is a snapshot of the original C code, the contents of the heap, and a continuation which is the locus of control when the binary is restarted. Yenta is a single binary file This procedure means that Yenta may be shipped as a single binary, with no ancillary files of any sort. Users who download the binary may simply run it as-is, with no compilation or configuration steps. Making this process trivial was a high priority in Yenta's design, since even the vast majority of UNIX users would find it either incon- venient or impossible to actually compile an application from source. A very small percentage of those who might run Yenta actually would if they had to build it from scratch. Of course, since Yenta's source distribution is public (subject to export restrictions), anyone who wishes to build Yenta, either because they do not trust the binaries, or because they need a binary for some machine not already available, is free to do so. 4.6.4 Architectures Because ease of installation was a design priority, Yenta is distributed with precom- piled binaries for popular UNIX platforms. As of this writing, this includes Red Hat Linux 5.1, NetBSD 1.3.2, HPUX 9 and 10, SGI Irix 6.2, and Alpha OSF1. It is quite likely that Yenta will compile with no work on many other architectures, but these were the only ones routinely available to the author. 4.7 Determining user interests Given a collection of documents, as detailed in Section 4.4.4, how does Yenta actually determine the user's interests? 4.7.1 Producing word vectors The first step consists of turning each document into a weighted vector of keywords. Each keyword corresponds to some word that appears in the original document, with certain modifications [146]: Toss stopwords o Very common words (stopwords) are removed. Toss machine-generated phrases o Anything matching an exclusion regular expression is removed. This gets rid of HTML markup, PGP signature blocks, base64-encoded MIME documents, mes- sage header field keywords (e.g., anything to the left of the colon in an RFC822 email header field), and a large number of similar elements. Stem o The remaining words are stemmed [133] to remove suffixes. This causes words which have the same root, but are used as different parts of speech, to be more like- ly to match. Note that this step of the algorithm is English-specific; if Yenta was ever ported to some other language, the logic of this stemmer would have to be modified. Weight and vectorize The number of times each resulting word occurs in each document is then counted, and the result normalized by the total length of the document. This ensures that long documents do not disproportionally weight the results. The end result of this process is a word vector, which details, for each document, which interesting words occur in it. 4.7.2 Clustering The second step of the process produces clusters of documents which appear to be talking about similar topics. Each one of the clusters formed is potentially one of the user's interests, and is what is referred to more generally as a characteristic in Section 2.6. The algorithm which forms the clusters operates as follows. We pick a random start- ing vector, V, and then pick a second vector, W. We dot the two vectors together, which determines the similarity of one vector to another. If they match within a threshold, both vectors form the start of some cluster C. If not, we let W also be the start of a new cluster, and pick a third vector, X, dotting that against the two vectors we already have. Any close match joins its cluster; bad matches form their own clusters. After we have generated a few clusters, we stop attempting to generate more, and sim- ply dot the remaining vectors against vectors already in clusters. (For efficiency, we maintain a moving-average representation of each cluster's centroid; this means that testing a vector against a cluster requires dotting it against only one average vector, and not against each vector in the cluster.) When this terminates, we are left with a collection of clusters, and a collection of vec- tors which were not similar enough to any already-existing cluster to wind up in one. The next step is to investigate the fitness of each cluster -- after all, the moving aver- age centroid of any given cluster might have left behind the first few vectors to have been added. This can happen if we are unlucky in our choice of initial vector, and the centroid shifts a large amount due to later additions. Thus, we prune already-existing clusters by dotting each vector already in a given cluster against that cluster's centroid vector. Vectors which are no longer close enough are discarded again. We are now left with some pruned clusters and a pile of extra vectors. This latter pile is made up of vectors which never made it into a cluster in the first place, plus vectors that have been discarded from existing clusters. It is possible that some of these vec- tors are sufficiently alike that they could form a cluster of their own, so we start the clustering process again, using this pile of discards -- one of the vectors we start with may form the seed of a new cluster. After the initial cluster-formation step, we check each vector in the discard pile against all clusters we have generated, and keep any good matches. This algorithm iterates, controlled by thresholds at various points, until some propor- tion of vectors are in clusters, and enough iterations have run. We are left with clusters that have empirically-reasonable variance in terms of the vectors they include, and a pile of leftover vectors. The algorithm actually runs the forward (clustering) direction and the reverse (prun- ing) direction in parallel. This is analogous to the way bone is formed -- via cells called osteoblasts -- and destroyed -- via osteoclasts. Bone is piezoelectric, and gener- ates an electrostatic field when under mechanical stress. Osteoclasts are constantly tearing down bone, whereas osteoblasts produce more bone wherever this is a large electrostatic field. Hence, bone preferentially builds up wherever the stress is high- est -- hence reducing the stress again -- without building up in places where it is not needed. Yenta's document-clustering algorithm constantly tries to build up a cluster by adding any vector which is close to that cluster's centroid, while it simultaneously tries to tear down the cluster by removing any vector newly deemed unfit to remain. The initial vectorizing algorithm, which converts documents to vectors of keywords, runs in time and space that is approximately linear in the number of words in all doc- uments. The clustering algorithm is slightly more complicated. The forward direction runs in approximately linear time, due to its use of the moving-centroid approach. The reverse direction runs in approximately O(n2) time, since the total number of times any given vector might be chosen to compare against the centroid depends on the size of the cluster and how long this cluster has been around. However, since the number of clusters -- generally under a hundred, and often under twenty -- is much smaller than the typical number of documents -- which typically number in the thousands or more -- the overall behavior of the clustering algorithm is typically close to linear. The algorithm chosen here was simply generated ad-hoc. We shall have more to say about its performance in Chapter 5, but the overall lesson is that it seems to work well enough. Since Yenta makes no particular claims either to advance the state of infor- mation-retrieval research, nor of optimality across any particular dimension of docu- ment comparison, this is acceptable. 4.8 Security considerations We spoke at length about the security of the general architecture in Chapter 3. Here, we shall speak about a few wrinkles that Yenta introduces. 4.8.1 Encrypting connections Connections between Yentas, and connections from Yenta to the user's browser, are always encrypted. This is accomplished by running SSL [186] between each pair of communicating agents, using Diffie-Hellman key exchange for perfect forward secrecy -- session keys are discarded at the end of the connection -- and self-signed certificates to complicate man-in-the-middle attacks. Note that these self-signed certificates make it difficult to do a man-in-the-middle attack only between two Yentas (or a Yenta and a browser) that have previously com- municated. They are worthless if a man in the middle can be in the middle from the very start of the conversation, since there is no certifying authority, nor a web of trust, available to validate the cert. On the other hand, since we are in the case that we have never talked to the Yenta at the far end of the connection anyway, we might as well treat the man in the middle as just some other unknown Yenta we have never spoken with. The man in the middle can keep both ends from knowing the true YID of the endpoints, but it cannot otherwise cause much trouble -- for example, attestations are signed by other Yentas, not by the Yenta belonging to the user the attestation refers to. Indeed, were someone to set up a man-in-the-middle Yenta that successfully passes data in both directions to two other Yentas, the largest apparent problem surfaces if the middle Yenta vanishes -- at that point, neither endpoint knows how to talk to the other. 4.8.2 Protecting persistent state Yenta must save persistent state to disk. If it did not do this, it could not survive the crash of either Yenta or the host computer. There are two cases here: the user's charac- teristics, and everything else. Characteristics First, we have the user's characteristics, which were derived from the user's file and email. These are stored unencrypted, for two reasons: o The characteristics were originally derived from reading messages which arrived over cleartext channels, and are stored on the disk in the clear. The original repre- sentation of this data (files and email) is far more comprehensible to humans than the vectorized, stopworded, stemmed representation left on disk -- hence, leaving this data around on disk, assuming it is at least protected against other readers using filesystem protection bits, is no more of a privacy exposure than what the user was already doing. o The Savant library is unprepared for dealing with encrypted data. If we did not also have the case detailed in the above bullet, it would be worth fixing this. As it is, however, such effort would not improve Yenta's privacy. Keys, conversations, ... Even though the user's characteristics were derived from the user's mail -- presumed to already be sitting around on disk in the clear -- the stored conversations in which the user has participated were not formerly stored in the clear, and were carefully transmitted between agents using encrypted protocols. We should not presume to expose them once they have been stored on disk. Even worse, the user's private key -- the very basis of his or her identity -- is in the same file. Exposing this would be a disaster, since it could allow anyone to both eavesdrop and impersonate the user. The strategy used is to encrypt the data directly to disk, using IDEA in cipher-block- chaining (CBC) mode. It uses ePTOBs, aka encrypting Scheme port objects, which act like normal Scheme ports, but encrypt or decrypt along the way -- they use SSLeay for their underlying implementation. The question then is, how does Yenta store the key so the data may be decrypted later? What it does is to write out a small preamble, which consists of some bootstrapping data, and then the main data, which consists of the encrypted state. Both of these are written to the same file on disk. Yenta's actual persistent state is a variable-length string of bytes, called D. [We do not compute a MAC of D; perhaps we should if we can. This would provide some protec- tion against an attack that changes bit(s) of ciphertext (hence trashing the plaintext), but it would require somehow either precomputing a checksum, or computing one on the fly as data is written out. Both are somewhat inconvenient.] When Yenta first starts up, it asks the user for a passphrase, P. This passphrase does not change unless the user manually changes it. Yenta immediately computes the SHA-1 hash of the passphrase, PSHA, and throws away P. Saving state Each time Yenta needs to save state, it generates a new 128-bit session key, K, which is used for keying the cipher. It also generates a 64-bit verifier, V. Both of these are high- quality random numbers, drawn from the random pool. Finally, it generates an encrypted version of the session key, KP, using the first 128 bits of PSHA as the encryp- tion key and IDEA as the cipher. (Since we're encrypting 128 bits of random data, we need neither any block-chaining, nor any IV.) It then writes out the following data o To the preamble (key) portion of the file, in the clear: o The cleartext version of the browser cert o The encrypted version of the session key, KP. o To the main (data) portion of the file, encrypted on the fly (via an ePTOB keyed by K, the session key): o Two copies of the verifier, V, one immediately after the other; we shall call this V1V2. o The persistent state, D. Because data is encrypted on the fly, before it hits the disk, what we have really writ- ten to the main data portion of the file is really [V1V2]K and DK. Restoring state Yenta only reads its persistent state upon startup. The first thing it must do is to read the cleartext version of the browser cert from the keyfile. It requires this data so it can establish an SSL connection to the user's browser, without generating a brand-new certificate -- doing so would require that the user walk through all the cert-validation menus in the browser for every Yenta startup. Yenta then prompts the user for the passphrase, P, and computes PSHA, as above. It then reads the encrypted session key, KP, from the preamble, and decrypts it, using the first 128 bits of PSHA as the key. This regenerates the true session key, K. Now that K is known, Yenta continues reading, now in the encrypted portion of the file, and reads the first 128 bits from it, which should be V1V2 -- the two concatenated copies of V. If V1 does not match V2, then K must be incorrect. For K to be incorrect, we must have incorrectly decrypted KP, which implies that PSHA is wrong. The only way this could happen is if the user mistyped the passphrase, so we prompt again, and repeat. Assuming that the verifier matches, we now have a correct session key, so we supply that to the decrypting ePTOB and read the rest of the file, which converts DK back to D. Vulnerability analysis What vulnerabilities might exist in this approach? o Data is never left unencrypted anywhere on disk. o We assume that IDEA-CBC is secure up to brute-force keysearch. Nonetheless, we assume that we do not want to gratuitously enable a known-plaintext attack. [The ePTOB itself also includes a 64-bit IV before the encrypted data; this helps to foil known-plaintext attacks on the first block. This would otherwise be a very simple attack, since the contents of the first block are nearly constant for all Yentas.] o We assume good random numbers. o We are not secure against an attack that can read the contents of Yenta's address space. (This is true of the entire design: anyone who can read the address space can suck out PSHA, which is kept around indefinitely. This does not matter, though, be- cause such an attack could suck out the RSA keypair which defines the basis of the user's identity -- this is far worse, and is basically a complete compromise, allowing both eavesdropping and spoofing.) o A weak passphrase is vulnerable to dictionary attack, which will allow decrypting the session key and thus allow access to the plaintext of the private key. o It is possible that [V1V2]K could leak some information to a cryptanalyst. E.g., it is known that 4 bytes are repeated in the next block in a predictable place in the ci- phertext (since we use an IV but not variable padding). This does not appear to be an actual vulnerability, since V is not known plaintext. (Hashing the second copy might help even so, or might only add a constant factor to the attack; not clear.) It appears, as usual, that the primary vulnerabilities are (a) insecure process address space, and (b) the user picking a poor passphrase. Full disks There is one final consideration. What happens when the disk fills up? Yenta tries to be relatively careful about the integrity of the saved statefile. After all, if this file is corrupted, the user's private key goes with it, and hence all of the user's identity and reputations (via attestations signed by other users) as well. This is an intolerable loss. The most obvious defense is to write a temporary copy of the statefile, ensure that it is correct, and then atomically rename it over the old copy. This means that a crash in the middle of the write will not corrupt the existing statefile. But how do we know that the tempfile was, in fact, written correctly? SCM does not signal any errors in any of its stream-writing functions, because it fails to check the return values of any of the underlying C calls. This means that, if the disk fills up, the Scheme procedures write, display, and related functions will merrily attempt to fill the disk to full and bursting, and will continue dumping data overboard even after the disk is full, all without signalling any errors. This is an unfortunate, but hard to fix, implementation issue. Even if we check at the beginning and the end whether the disk is full (by writing a sacrificial file and seeing if we get the bytes back when we read it), consider what happens if the disk momentarily fills in the middle of saving state, then unfills. This could easily happen if something writes a tempfile at the wrong moment. In this case, SCM will silently throw away n bytes of intended output, while not detecting the fail- ure. Even rereading the file may fail to detect it, if the dropped bytes were inside a string constant. One possible solution is to call force-output after every single charac- ter, then stat the file and see if its length has incremented, or, alternatively, to write and read a sacrificial file after each character of real output. Either of these approaches is (a) extremely difficult to implement (since we write output in larger chunks, and through an encrypting stream as well), and (b) horribly inefficient, proba- bly slowing down checkpointing by at least two orders of magnitude if not more. To avoid this, we run a verification function over the data written, every time it is writ- ten. This function does the work of reading and checking the contents of the preamble against the running Yenta (e.g., encryption protocol version, the browser cert and browser private key, etc.), and then computes the SHA-1 hash of the entire encrypted portion of the file, e.g., of the data portion in the discussion above. This is then com- pared with an identical hash, computed seconds earlier when the data was written to disk. If anything is wrong with the preamble or if the hashes do not match, then some- thing is wrong with the data we just wrote; a single byte missing or even a single bit trashed will be evident. In this case, we do not rename our obviously-corrupt tempfile over the last success- fully-saved statefile. Instead, we delete it again, since it may be contributing to a disk- full condition and is bad in any event. In addition, we set a variable so the user inter- face knows that something is wrong, and can tell the user, who can presumably attempt to fix whatever is preventing us from successfully writing the statefile. Note that this gives us no protection over having the statefile trashed after we have checkpointed. If Yenta is still running, the damage will be undone at the next check- point, since the old file will simply be thrown away unread. However, if Yenta was not running when the file was trashed, Yenta will simply fail to be able to correctly read the entire thing. (Chances are overwhelming that any corruption of the file will yield garbage after decryption that read will complain about, and Yenta will be unable to finish loading its variables.) In this case, the user will have no choice but to restore the file from backup. This is the expected case anyway if files are being trashed at random in the filesystem. Note also that Yenta's support applications, which write plaintext statefiles and do not save state using encryption, do not do this checking. They save very little irrecover- able state in normal operation; the big exception is the statistics logger, which will simply have its data truncated, losing log entries that arrive while the disk is full and possibly leaving a corrupted last entry. This is not considered a serious problem. Fur- thermore, this state is not being saved in a statefile at all, but is being explicitly written to a a separate logfile. 4.8.3 Random numbers Yenta's security is dependent upon having good random numbers, since these num- bers determine the quality of its cryptographic keys. On machine architectures which have /dev/random, Yenta simply uses that -- it is designed to be good enough for most cryptographic applications, and tries hard to collect random state from all over the machine. Machines which lack /dev/random instead prompt the user, the very first time Yenta starts up, to enter a large number of keystrokes, and Yenta measures the interarrival time of these keystrokes. This is the same technique (and partially the same code) used by PGP. Yenta then maintains that random state by keeping a random pool, which is a collec- tion of random bits. Everything that uses bits from the pool, such as generating a key, keeps track of the number of bits used, and Yenta runs several tasks at a variety of time intervals which attempt to regenerate randomness in the pool by running a vari- ety of programs which sample many events happening on the system, as well as also using /dev/random, if available. This random-pool is saved when Yenta checkpoints its state, so newly-started Yentas have randomness. As long as Yenta can continue to gather randomness data from the machine faster than it is consumed to generate, e.g., session keys, its cryptographic quality should remain high. (If Yenta cannot do this, it warns the user; this is considered an implementation error.) 4.9 Summary In this chapter, we have described Yenta -- the sample application which demonstrates how the underlying architecture can be used in a real system, and which is intended to raise public awareness of the techniques developed in this research and the rationale for their development. We have presented several sample scenarios to motivate why Yenta is useful, and then described the various affordances provided by Yenta to its users. These include automatic determination of interests, messaging into groups of users who share interests or to particular individuals, automatic introductions, and a reputation system. We then delved into Yenta's implementation, describing the gen- eral structure of the code, what the major pieces are, and how they fit together. Finally, we discussed those security considerations which are specific to Yenta itself and not necessarily to the general architecture.