Return-Path: skip@pobox.com Delivery-Date: Mon Sep 9 20:19:04 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 9 Sep 2002 14:19:04 -0500 Subject: [Spambayes] deleting "duplicate" spam before training? good idea orbad? In-Reply-To: References: <15740.52432.861148.597750@12-248-11-90.client.attbi.com> Message-ID: <15740.62504.533596.995365@12-248-11-90.client.attbi.com> >> I wrote a script some time ago to try an minimize the duplicates I >> see by calculating a loose checksum, but I still have some >> duplicates. Should I delete the duplicates before training or not? Tim> People just can't stop thinking . The classifier should work Tim> best when trained on a wholly random spattering of real life. If Tim> real life contains duplicates, then that's what the classifier Tim> should see. A bit more detail. I get destined for many addresses: skip@pobox.com, skip@calendar.com, concerts@musi-cal.com, webmaster@mojam.com, etc. I originally wrote (a slightly different version of) the loosecksum.py script I'm about to check in to avoid manually scanning all those presumed spams which are really identical. Once a message was identified as spam, what I refer to as a loose checksum was computed to try and avoid saving the same spam multiple times for later review. >> Would people be interested in the script? I'd be happy to extricate >> it from my local modules and check it into CVS. Tim> Sure! I think it's relevant, but maybe for another purpose. Paul Tim> Svensson is thinking harder about real people than the rest Tim> of us, and he may be able to get use out of approaches that Tim> identify closely related spam. For example, some amount of spam is Tim> going to end up in the ham training data in real life use, and any Tim> sort of similarity score to a piece of known spam may be an aid in Tim> finding and purging it. I'll check it in. Let me know if you find it useful. Skip