GeronBook/Ch3/datasets/spam/easy_ham/01741.2a15d667c53727befded9...

Return-Path: gward@python.net
Delivery-Date: Mon Sep  9 20:25:42 2002
From: gward@python.net (Greg Ward)
Date: Mon, 9 Sep 2002 15:25:42 -0400
Subject: [Spambayes] deleting "duplicate" spam before training?  good idea
	orbad?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIECKBDAB.tim.one@comcast.net>
References: <15740.52432.861148.597750@12-248-11-90.client.attbi.com>
	<LNBBLJKPBEHFEDALKOLCIECKBDAB.tim.one@comcast.net>
Message-ID: <20020909192542.GB2002@cthulhu.gerg.ca>

On 09 September 2002, Tim Peters said:
> > Would people be interested in the script?  I'd be happy to extricate
> > it from my local modules and check it into CVS.
>
> Sure!  I think it's relevant, but maybe for another purpose.  Paul Svensson
> is thinking harder about real people <wink> than the rest of us, and he may
> be able to get use out of approaches that identify closely related spam.
> For example, some amount of spam is going to end up in the ham training data
> in real life use, and any sort of similarity score to a piece of known spam
> may be an aid in finding and purging it.

OTOH, look into DCC (Distributed Checksum Clearinghouse,
http://www.rhyolite.com/anti-spam/dcc/), which uses fuzzy checksums.
It's quite likely that DCC's checksumming scheme is better than
something any of us would throw together for personal use (no offense,
Skip!).  But I have no personal experience of it.

        Greg
--
Greg Ward <gward@python.net>                         http://www.gerg.ca/
If it can't be expressed in figures, it is not science--it is opinion.