Return-Path: tim.one@comcast.net Delivery-Date: Tue Sep 10 04:18:25 2002 From: tim.one@comcast.net (Tim Peters) Date: Mon, 09 Sep 2002 23:18:25 -0400 Subject: [Spambayes] Current histograms In-Reply-To: Message-ID: We've not only reduced the f-p and f-n rates in my test runs, we've also made the score distributions substantially sharper. This is bad news for Greg, because the non-existent "middle ground" is becoming even less existent : Ham distribution for all runs: * = 1333 items 0.00 79975 ************************************************************ 2.50 1 * 5.00 0 7.50 0 10.00 2 * 12.50 1 * 15.00 0 17.50 0 20.00 0 22.50 1 * 25.00 0 27.50 0 30.00 0 32.50 0 35.00 0 37.50 1 * 40.00 0 42.50 0 45.00 0 47.50 0 50.00 0 52.50 0 55.00 0 57.50 0 60.00 1 * 62.50 0 65.00 1 * 67.50 0 70.00 0 72.50 0 75.00 0 77.50 0 80.00 0 82.50 0 85.00 0 87.50 0 90.00 0 92.50 0 95.00 0 97.50 17 * Spam distribution for all runs: * = 914 items 0.00 118 * 2.50 7 * 5.00 0 7.50 2 * 10.00 1 * 12.50 1 * 15.00 3 * 17.50 1 * 20.00 1 * 22.50 1 * 25.00 0 27.50 0 30.00 4 * 32.50 3 * 35.00 4 * 37.50 2 * 40.00 0 42.50 1 * 45.00 1 * 47.50 0 50.00 2 * 52.50 3 * 55.00 1 * 57.50 2 * 60.00 0 62.50 1 * 65.00 1 * 67.50 10 * 70.00 2 * 72.50 1 * 75.00 2 * 77.50 1 * 80.00 0 82.50 0 85.00 1 * 87.50 4 * 90.00 2 * 92.50 5 * 95.00 14 * 97.50 54806 ************************************************************ As usual for me, this is an aggregate of 20 runs, each both training and predicting on 4000 c.l.py ham + ~2750 BruceG spam. Only 25 ham scores out of 80,000 are above 0.025 now (and, yes, the "Nigerian scam"-quoting msg is still counted as ham -- I haven't taken anything out of the ham corpus since remving the "If AOL were a car" spam), the f-p rate wouldn't have changed at all if the spamprob cutoff were dropped from 0.90 to 0.675, dropping the cutoff to 0.40 would have added only 2 false positives, and dropping it to 0.15 would have added only another 2 more! It's spooky.