How the Fraudulator Works

Using a computer to compare two samples of prose directly with each other will give you a lot of detail on how they differ. But the detail will overwhelm any understanding of whether the two samples are related to each other. A human being, by contrast, can often look at a pair of samples and make a good guess about their provenance.

To make software capable of the same task, you must abstract out the detail so that you can analyse the deeper structure. A first step, for example, might be to monocase every word in the samples, so that "Apple" and "apple" are the same. The ultimate goal, though, is to abstract out all but the relevant detail in analyzing your prose samples.

I tried many different ways to hash these 419 messages before settling on the algorithm described here. Hashing by letter frequency, for example, looked initially promising but turned out to be a dead end. The relative brevity of each sample makes this problem difficult. The samples average about 475 words in length. On the plus side, their content is tightly constrained by their purpose -- there are only a few ways to tell this story, so each author's style tends to shine through. Vocabulary analysis turned out to be the most effective way to group these samples.

Vocabulary Analysis

The samples use a total of about 10,500 distinct words. As you'd expect, a histogram of the frequencies of these words shows a two-peak graph, with many words which occur less than ten times in the whole database, and many words which occur multiple times in every sample. We're most interested in the words which occur in some but not all of the samples, because the samples which share them have a high probability of sharing authorship. The word "demurrage", for example, occurs in only 42 members of the collection, so its presence indicates that these samples share a common ancestor if not a common author.

Word Vector Analysis

This site abstracts out the interesting words in each prose sample in the database, then uses a linear algebra technique called a "dot product" to compare the input sample to each of my samples to give a list ranked by similarity. Thanks to Michael Hirsch for his patient explanations of how dot-products and vector analysis work. The first step of the algorithm is to create a database of all the words in all the letters with interesting frequencies.

After I have this hash file created, I walk through the database and create a 'word-frequency vector' for each sample. This vector consists of a list of numbers, one for each word in the hash file. If a word in the hash file occurs in the prose sample, the corresponding member of the list gets a positive number. If the prose sample does not contain the word, then the corresponding member of the list gets a '0'. I save the lists and the sample filenames in a file, one to a line.

To generate the list of samples by similarity to the input sample, then, I generate a word-frequency vector for the input sample, then generate the dot product between it and each sample in my database. I then sort the database samples by their dot products. The output is the top N members of this list, where N is the output list size you have selected from the database web page.

Dot Products of Vectors

The dot product is a way of computing the similarity between a pair of vectors. The mathematical explanation is somewhat daunting, but my use is relatively simple. If you think of a vector as simply a list of numbers which graph a line in an arbitrary number of dimensions, then taking the dot product of a pair of vectors in the same space is simple: take every number in one vector, multiply it by the corresponding number in the other vector, and add the result to a total. The final total will vary between 0 (if the vectors are exactly perpendicular to each other) and the squares of all of one vector's points added together (if they are the same).

Vector pairs with the largest total created by this operation have the smallest angle separating them. If the two vectors have different lengths, this process is thrown off a little because the longer vector has points which aren't compared to the shorter one. In my case, this problem comes up when I try to compare samples of different length. In order to fix this, I scale the numbers I put into the word vector; rather than simply inserting a "1" for each word in the sample which occurs in the global list, I insert a scaling factor (currently set at 10000) divided by the length of the letter. In that way, a word hit from my hash which occurs in a longer letter counts a little less in the final total than a hit which occurs in a shorter one. I could add additional scaling factors based on the frequency of each word in the total wad of prose, but so far that has not been necessary.

Web-based Database Maintenance

My collection of 419 spam samples could change, so I wrote a perl script and associated web page to re-generate the word list and the sample vectors derived from it. The page simply asks for the highest frequency and lowest frequency to include in the global word list, then cuts loose a perl script to generate it and its associated vector list. Naturally, this page is not linked and is password-protected.

So far I am not collecting the samples pasted into the public web form, but a further enhancement might involve this. At that point, I can write more scripts and forms to move the interesting samples into the database and re-gen the index files.