Replied: Thu, 10 Oct 2002 13:14:29 +0100 Replied: yyyy@example.com (Justin Mason) Replied: Daniel Quinlan Replied: SpamAssassin-talk@example.sourceforge.net Replied: SpamAssassin-devel@example.sourceforge.net From quinlan@pathname.com Thu Oct 10 12:29:12 2002 Return-Path: Delivered-To: yyyy@localhost.example.com Received: from localhost (jalapeno [127.0.0.1]) by jmason.org (Postfix) with ESMTP id 4B24416F03 for ; Thu, 10 Oct 2002 12:29:11 +0100 (IST) Received: from jalapeno [127.0.0.1] by localhost with IMAP (fetchmail-5.9.0) for jm@localhost (single-drop); Thu, 10 Oct 2002 12:29:11 +0100 (IST) Received: from proton.pathname.com (adsl-216-103-211-240.dsl.snfc21.pacbell.net [216.103.211.240]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g9A4kRK08872 for ; Thu, 10 Oct 2002 05:46:27 +0100 Received: from quinlan by proton.pathname.com with local (Exim 3.35 #1 (Debian)) id 17zVDy-0006cM-00; Wed, 09 Oct 2002 21:47:02 -0700 To: yyyy@example.com (Justin Mason) Cc: SpamAssassin-talk@example.sourceforge.net, SpamAssassin-devel@lists.sourceforge.net, Steve Atkins , ion@aueb.gr, donatespam@archub.org, spambayes@python.org Subject: Re: [SAdev] fully-public corpus of mail available References: <20021009122116.6EB2416F03@example.com> From: Daniel Quinlan Date: 09 Oct 2002 21:47:02 -0700 In-Reply-To: yyyy@example.com's message of "Wed, 09 Oct 2002 13:21:11 +0100" Message-Id: Lines: 40 X-Mailer: Gnus v5.7/Emacs 20.7 X-Spam-Status: No, hits=-118.9 required=5.0 tests=AWL,IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES, REPLY_WITH_QUOTES,T_NONSENSE_FROM_00_10, T_NONSENSE_FROM_10_20,T_NONSENSE_FROM_20_30, T_NONSENSE_FROM_30_40,T_NONSENSE_FROM_40_50, T_NONSENSE_FROM_50_60,T_NONSENSE_FROM_60_70, T_NONSENSE_FROM_70_80,T_NONSENSE_FROM_80_90, T_NONSENSE_FROM_90_91,T_NONSENSE_FROM_91_92, T_NONSENSE_FROM_92_93,T_NONSENSE_FROM_93_94, T_NONSENSE_FROM_94_95,T_NONSENSE_FROM_95_96, T_NONSENSE_FROM_96_97,T_NONSENSE_FROM_97_98, T_NONSENSE_FROM_98_99,T_NONSENSE_FROM_99_100, T_QUOTED_EMAIL_TEXT,USER_AGENT_GNUS_XM version=2.50-cvs X-Spam-Level: > (Please feel free to forward this message to other possibly-interested > parties.) Some caveats (in decending order of concern): 1. These messages could end up being falsely (or incorrectly) reported to Razor, DCC, Pyzor, etc. Certain RBLs too. I don't think the results for these distributed tests can be trusted in any way, shape, or form when running over a public corpus. 2. These messages could also be submitted (more than once) to projects like SpamAssassin that rely on filtering results submission for GA tuning and development. 3. Spammers could adopt elements of the good messages to throw off filters. And, of course, there's always progression in technology (by both spammers and non-spammers). The second problem could be alleviated somewhat by adding a Nilsimsa signature (or similar) to the mass-check file (the results format used by SpamAssassin) and giving the message files unique names (MD5 or SHA-1 of each file). The third problem doesn't really worry me. These problems (and perhaps others I have not identified) are unique to spam filtering. Compression corpuses and other performance-related corpuses have their own set of problems, of course. In other words, I don't think there's any replacement for having multiple independent corpuses. Finding better ways to distribute testing and collate results seems like a more viable long-term solution (and I'm glad we're working on exactly that for SpamAssassin). If you're going to seriously work on filter development, building a corpus of 10000-50000 messages (half spam/half non-spam) is not really that much work. If you don't get enough spam, creating multi-technique spamtraps (web, usenet, replying to spam) is pretty easy. And who doesn't get thousands of non-spam every week? ;-) Dan