195 lines
6.7 KiB
Plaintext
195 lines
6.7 KiB
Plaintext
Return-Path: tim.one@comcast.net
|
|
Delivery-Date: Wed Sep 11 04:46:15 2002
|
|
From: tim.one@comcast.net (Tim Peters)
|
|
Date: Tue, 10 Sep 2002 23:46:15 -0400
|
|
Subject: [Spambayes] XTreme Training
|
|
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEGPBDAB.tim.one@comcast.net>
|
|
Message-ID: <LNBBLJKPBEHFEDALKOLCOEHDBDAB.tim.one@comcast.net>
|
|
|
|
[Tim]
|
|
> ...
|
|
> At the other extreme, training on half my ham&spam, and scoring aginst
|
|
> the other half
|
|
> ...
|
|
> false positive rate: 0.0100%
|
|
> false negative rate: 0.3636%
|
|
> ...
|
|
> Alas, all 4 of the 0.99 clues there are HTML-related.
|
|
|
|
That begged to try it again but with Tokenize/retain_pure_html_tags false.
|
|
The random halves getting trained on and scored against are different here,
|
|
and I repaired the bug that dropped 1 ham and 1 spam on the floor, so this
|
|
isn't exactly a 1-change difference between runs.
|
|
|
|
Ham distribution for all runs:
|
|
* = 167 items
|
|
0.00 9999 ************************************************************
|
|
10.00 0
|
|
20.00 0
|
|
30.00 0
|
|
40.00 0
|
|
50.00 0
|
|
60.00 0
|
|
70.00 0
|
|
80.00 0
|
|
90.00 1 *
|
|
|
|
Spam distribution for all runs:
|
|
* = 115 items
|
|
0.00 21 *
|
|
10.00 0
|
|
20.00 0
|
|
30.00 1 *
|
|
40.00 0
|
|
50.00 0
|
|
60.00 1 *
|
|
70.00 0
|
|
80.00 1 *
|
|
90.00 6852 ************************************************************
|
|
|
|
false positive rate: 0.0100%
|
|
false negative rate: 0.3490%
|
|
|
|
Yay! That may mean that HTML tags aren't really needed in my test data
|
|
provided it's trained on enough stuff. Curiously, the sole false positive
|
|
here is the same as the sole false positive on the half&half run reported in
|
|
the preceding msg (I assume the Nigerian scam "false positive" just happened
|
|
to end up in the training data both times):
|
|
|
|
************************************************************************
|
|
Data/Ham/Set4/107687.txt
|
|
prob = 0.999632042904
|
|
prob('python.') = 0.01
|
|
prob('alteration') = 0.01
|
|
prob('edinburgh') = 0.01
|
|
prob('subject:Python') = 0.01
|
|
prob('header:Errors-To:1') = 0.0216278
|
|
prob('thanks,') = 0.0319955
|
|
prob('help?') = 0.041806
|
|
prob('road,') = 0.0462364
|
|
prob('there,') = 0.0722794
|
|
prob('us.') = 0.906609
|
|
prob('our') = 0.919118
|
|
prob('company,') = 0.921852
|
|
prob('visit') = 0.930785
|
|
prob('sent.') = 0.939882
|
|
prob('e-mail') = 0.949765
|
|
prob('courses') = 0.954726
|
|
prob('received') = 0.955209
|
|
prob('analyst') = 0.960756
|
|
prob('investment') = 0.975139
|
|
prob('regulated') = 0.99
|
|
prob('e-mails') = 0.99
|
|
prob('mills') = 0.99
|
|
|
|
Received: from [195.171.5.71] (helo=node401.dmz.standardlife.com)
|
|
by mail.python.org with esmtp (Exim 3.21 #1)
|
|
id 15rDsu-00085k-00
|
|
for python-list@python.org; Wed, 10 Oct 2001 03:34:32 -0400
|
|
Received: from slukdcn4.internal.standardlife.com (slukdcn4.standardlife.com
|
|
[10.3.2.72])
|
|
by node401.dmz.standardlife.com (Pro-8.9.3/Pro-8.9.3) with SMTP id
|
|
IAA53660for <python-list@python.org>; Wed, 10 Oct 2001 08:34:00
|
|
+0100
|
|
Received: from sl079320 ([172.31.88.231]) by
|
|
slukdcn4.internal.standardlife.com (Lotus SMTP MTA v4.6.6 (890.1
|
|
7-16-1999))
|
|
with SMTP id 80256AE1.00294B60; Wed, 10 Oct 2001 08:31:02 +0100
|
|
Message-ID:
|
|
<007e01c1515d$bb255940$e7581fac@sl079320.internal.standardlife.com>
|
|
From: "Vickie Mills" <vickie_mills@standardlife.com>
|
|
To: <python-list@python.org>
|
|
Subject: Training Courses in Python in UK
|
|
Date: Wed, 10 Oct 2001 08:32:30 +0100
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain;
|
|
charset="iso-8859-1"
|
|
Content-Transfer-Encoding: 7bit
|
|
X-Priority: 3
|
|
X-MSMail-Priority: Normal
|
|
X-Mailer: Microsoft Outlook Express 4.72.3155.0
|
|
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3155.0
|
|
Sender: python-list-admin@python.org
|
|
Errors-To: python-list-admin@python.org
|
|
X-BeenThere: python-list@python.org
|
|
X-Mailman-Version: 2.0.6 (101270)
|
|
Precedence: bulk
|
|
List-Help: <mailto:python-list-request@python.org?subject=help>
|
|
List-Post: <mailto:python-list@python.org>
|
|
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
|
|
<mailto:python-list-request@python.org?subject=subscribe>
|
|
List-Id: General discussion list for the Python programming language
|
|
<python-list.python.org>
|
|
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
|
|
<mailto:python-list-request@python.org?subject=unsubscribe>
|
|
List-Archive: <http://mail.python.org/pipermail/python-list/>
|
|
|
|
Hi there,
|
|
|
|
I am looking for you recommendations on training courses available in the UK
|
|
on Python. Can you help?
|
|
|
|
Thanks,
|
|
|
|
Vickie Mills
|
|
IS Training Analyst
|
|
|
|
Tel: 0131 245 1127
|
|
Fax: 0131 245 1550
|
|
E-mail: vickie_mills@standardlife.com
|
|
|
|
For more information on Standard Life, visit our website
|
|
http://www.standardlife.com/ The Standard Life Assurance Company, Standard
|
|
Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland
|
|
(No SZ4) and regulated by the Personal Investment Authority. Tel: 0131 225
|
|
2552 - calls may be recorded or monitored. This confidential e-mail is for
|
|
the addressee only. If received in error, do not retain/copy/disclose it
|
|
without our consent and please return it to us. We virus scan all e-mails
|
|
but are not responsible for any damage caused by a virus or alteration by a
|
|
third party after it is sent.
|
|
************************************************************************
|
|
|
|
The top 30 discriminators are more interesting now:
|
|
|
|
'income' 629 0.99
|
|
'http0:python' 643 0.01
|
|
'header:MiME-Version:1' 672 0.99
|
|
'http1:remove' 693 0.99
|
|
'content-type:text/html' 711 0.982345
|
|
'string' 714 0.01
|
|
'http>1:jpg' 776 0.99
|
|
'object' 813 0.01
|
|
'python,' 852 0.01
|
|
'python.' 882 0.01
|
|
'language' 883 0.01
|
|
'>>>' 907 0.01
|
|
'header:Return-Path:2' 907 0.99
|
|
'unsubscribe' 975 0.99
|
|
'header:Received:7' 1113 0.99
|
|
'def' 1142 0.01
|
|
'http>1:gif' 1168 0.99
|
|
'module' 1169 0.01
|
|
'import' 1332 0.01
|
|
'header:Received:8' 1342 0.99
|
|
'header:Errors-To:1' 1377 0.0216278
|
|
'header:In-Reply-To:1' 1402 0.01
|
|
'wrote' 1753 0.01
|
|
' ' 2067 0.99
|
|
'subject:Python' 2140 0.01
|
|
'header:User-Agent:1' 2322 0.01
|
|
'header:X-Complaints-To:1' 4351 0.01
|
|
'wrote:' 4370 0.01
|
|
'python' 4972 0.01
|
|
'header:Organization:1' 6921 0.01
|
|
|
|
There are still two HTML clues remaining there (" " and
|
|
"content-type:text/html"). Anthony's trick accounts for almost a third of
|
|
these. "Python" appears in 5 of them ('http0:python' means that 'python'
|
|
was found in the 1st field of an embedded http:// URL). Sticking a .gif or
|
|
a .jpg in a URL both score as 0.99 spam clues. Note the damning pattern of
|
|
capitalization in 'header:MiME-Version:1'! This counting is case-sensitive,
|
|
and nobody ever would have guessed that MiME is more damning than SUBJECT or
|
|
DATE. Why would spam be likely to end up with two instances of Return-Path
|
|
in the headers?
|
|
|