Replied: Mon, 16 Sep 2002 00:33:33 +0100 Replied: yyyy@example.com (Justin Mason) Replied: Daniel Quinlan Replied: Matt Sergeant Replied: Spamassassin-devel From quinlan@pathname.com Mon Sep 16 00:08:16 2002 Return-Path: Delivered-To: yyyy@localhost.example.com Received: from localhost (jalapeno [127.0.0.1]) by jmason.org (Postfix) with ESMTP id 4B65C16F03 for ; Mon, 16 Sep 2002 00:08:15 +0100 (IST) Received: from jalapeno [127.0.0.1] by localhost with IMAP (fetchmail-5.9.0) for jm@localhost (single-drop); Mon, 16 Sep 2002 00:08:15 +0100 (IST) Received: from proton.pathname.com (adsl-216-103-211-240.dsl.snfc21.pacbell.net [216.103.211.240]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8ELf9C18215 for ; Sat, 14 Sep 2002 22:41:10 +0100 Received: from quinlan by proton.pathname.com with local (Exim 3.35 #1 (Debian)) id 17qKfW-0001tr-00; Sat, 14 Sep 2002 14:41:34 -0700 To: yyyy@example.com (Justin Mason) Cc: Matt Sergeant , Spamassassin-devel Subject: Re: [SAdev] testing with less rules References: <20020913211335.564DD16F03@example.com> From: Daniel Quinlan Date: 14 Sep 2002 14:41:34 -0700 In-Reply-To: yyyy@example.com's message of "Fri, 13 Sep 2002 22:13:30 +0100" Message-Id: Lines: 51 X-Mailer: Gnus v5.7/Emacs 20.7 X-Spam-Status: No, hits=-8.7 required=7.0 tests=AWL,EMAIL_ATTRIBUTION,IN_REP_TO,LINES_OF_YELLING, LINES_OF_YELLING_2,QUOTED_EMAIL_TEXT,REFERENCES, SPAM_PHRASE_00_01 version=2.50-cvs X-Spam-Level: jm@jmason.org (Justin Mason) writes: >>> DATE_IN_PAST_48_96 >>> SPAM_PHRASE_00_01 >>> SPAM_PHRASE_01_02 >>> SPAM_PHRASE_02_03 >>> SPAM_PHRASE_03_05 >>> SPAM_PHRASE_05_08 > I was thinking of just removing those particular rules, but keeping the > other entries in the range, since they're proving too "noisy" to be > effective. But I'd be willing to keep those ones in, all the same. What > do you think? Matt/Craig, thoughts? I think I could handle commenting out the lowest SPAM_PHRASE_XX_YY scores. If the GA could handle this sort of thing so they'd automatically be zeroed, I'd feel better since the ranges could change next time the phrase list is regenerated or the algorithm tweaked. I think we need to understand why DATE_IN_PAST_48_96 is so low before we remove it. The two rules on either side perform quite well. >> And here are the rules that seem like they should be better or should >> be recoverable: >>> FROM_MISSING >>> GAPPY_TEXT >>> INVALID_MSGID >>> MIME_NULL_BLOCK >>> SUBJ_MISSING > well, I don't like SUBJ_MISSING, I reckon there's a world of mails from > cron jobs (e.g.) which hit it. Okay, drop SUBJ_MISSING. > But, yes, the others for sure should be recoverable, and I'm sure there's > more. Probably a few, those seemed like the best prospects to me. > BTW do you agree with the proposed methodology (ie. remove the rules and > bugzilla each one?) I only want a bugzilla ticket for each one if people are okay with quick WONTFIX closes on the ones deemed unworthy of recovery. If you could put the stats for each rule in the ticket somehow (should be automatable with email at the very least), it would help. Dan