Return-Path: skip@pobox.com Delivery-Date: Mon Sep 9 17:31:12 2002 From: skip@pobox.com (Skip Montanaro) Date: Mon, 9 Sep 2002 11:31:12 -0500 Subject: [Spambayes] deleting "duplicate" spam before training? good idea or bad? Message-ID: <15740.52432.861148.597750@12-248-11-90.client.attbi.com> Because I get mail through several different email addresses, I frequently get duplicates (or triplicates or more-plicates) of various spam messages. In saving spam for later analysis I haven't always been careful to avoid saving such duplicates. I wrote a script some time ago to try an minimize the duplicates I see by calculating a loose checksum, but I still have some duplicates. Should I delete the duplicates before training or not? Would people be interested in the script? I'd be happy to extricate it from my local modules and check it into CVS. Skip