Return-Path: gward@python.net Delivery-Date: Mon Sep 9 20:25:42 2002 From: gward@python.net (Greg Ward) Date: Mon, 9 Sep 2002 15:25:42 -0400 Subject: [Spambayes] deleting "duplicate" spam before training? good idea orbad? In-Reply-To: References: <15740.52432.861148.597750@12-248-11-90.client.attbi.com> Message-ID: <20020909192542.GB2002@cthulhu.gerg.ca> On 09 September 2002, Tim Peters said: > > Would people be interested in the script? I'd be happy to extricate > > it from my local modules and check it into CVS. > > Sure! I think it's relevant, but maybe for another purpose. Paul Svensson > is thinking harder about real people than the rest of us, and he may > be able to get use out of approaches that identify closely related spam. > For example, some amount of spam is going to end up in the ham training data > in real life use, and any sort of similarity score to a piece of known spam > may be an aid in finding and purging it. OTOH, look into DCC (Distributed Checksum Clearinghouse, http://www.rhyolite.com/anti-spam/dcc/), which uses fuzzy checksums. It's quite likely that DCC's checksumming scheme is better than something any of us would throw together for personal use (no offense, Skip!). But I have no personal experience of it. Greg -- Greg Ward http://www.gerg.ca/ If it can't be expressed in figures, it is not science--it is opinion.