The fraudulator is a web application which helps you analyze your advance fee fraud spam. It contains a database of fraud spam letters and an application which will find letters similar to one which you present to it.
Advance fee fraud is the technical name for what is popularly called "419" or "Nigerian" fraud. This fraud's general outline has been popular since time immemorial -- in the old days it was often called the "Spanish Prisoner" fraud. The general outline is that the con artist sends a letter to a large population of potential marks. The letter claims (or implies) that the recipient is the only one who received it, and that in return for a trivial amount of work they can reap a large financial reward. Anyone who replies is told that there's a hitch, and that they need to pay money to ensure that this opportunity does not pass them by. The con then attempts to keep as many marks on the line paying him money for as long as possible.
Advance fee con artists have found that email is a potent tool for reaching a large number of people cheaply, and this kind of letter has become endemic in email inboxes all over the world. The source for much of this is apparently Nigeria, in part because its government is corrupt, internet access appears widespread, and there's a large population of poor, educated English speakers there. The letters themselves, of course, claim to be from all over the world. The slang term "419 email" comes from the section of the Nigerian code which makes this form of fraud illegal. A typical 419 letter looks like this.
I've been intrigued by 419 email for a long time. In June of 2003, Scientific American magazine published an amusing article in which some biologists used DNA cladistic analysis software on a collection of about 30 old-fashioned chain letters. The chain letter is a letter of the form "Send more copies of this letter to your friends or great misfortune will befall you". It's related to but often distinct from actually fraudulent mail. With the software, the three co-authors of the article were able to derive a fairly plausible-looking tree of descent for the letters they had.
This article Got Me to Thinking. At the time, I was (un)lucky enough to have an email address which received a large amount of spam (over 100 spams a day). Over the course of a year, I collected about 1,000 419 letters. A sample set of this size is too big for this kind of analysis. So the letters sat on my hard drive for a year or so. Occasionally I'd gnaw on them with various text analysis tools, trying to derive a pattern from them which was interesting and within the power of the machines I had at my disposal.
Eventually, I settled on writing something which would let me at least get a handle on how many writers my collection represented. I eventually found an algorithm which enables me to analyze a letter's vocabulary, which is a strong marker of a writer's style. This let me sort letters into groups by similarity, then calculate which group a new spam letter most resembles.
This site deals only with 419 letters in English. I've seen scam letters in French, German, and Spanish, but the software I've written analyzes letters by word choice and my database includes only English samples at this time. Nothing in the software itself restricts the language of the prose samples you can analyze. See the "How does this work?" link, below, for details.
The software, of course, is licensed under the GNU General Public License. Feel free to download the source and play with it!