Return-Path: tim.one@comcast.net Delivery-Date: Wed Sep 11 04:46:15 2002 From: tim.one@comcast.net (Tim Peters) Date: Tue, 10 Sep 2002 23:46:15 -0400 Subject: [Spambayes] XTreme Training In-Reply-To: Message-ID: [Tim] > ... > At the other extreme, training on half my ham&spam, and scoring aginst > the other half > ... > false positive rate: 0.0100% > false negative rate: 0.3636% > ... > Alas, all 4 of the 0.99 clues there are HTML-related. That begged to try it again but with Tokenize/retain_pure_html_tags false. The random halves getting trained on and scored against are different here, and I repaired the bug that dropped 1 ham and 1 spam on the floor, so this isn't exactly a 1-change difference between runs. Ham distribution for all runs: * = 167 items 0.00 9999 ************************************************************ 10.00 0 20.00 0 30.00 0 40.00 0 50.00 0 60.00 0 70.00 0 80.00 0 90.00 1 * Spam distribution for all runs: * = 115 items 0.00 21 * 10.00 0 20.00 0 30.00 1 * 40.00 0 50.00 0 60.00 1 * 70.00 0 80.00 1 * 90.00 6852 ************************************************************ false positive rate: 0.0100% false negative rate: 0.3490% Yay! That may mean that HTML tags aren't really needed in my test data provided it's trained on enough stuff. Curiously, the sole false positive here is the same as the sole false positive on the half&half run reported in the preceding msg (I assume the Nigerian scam "false positive" just happened to end up in the training data both times): ************************************************************************ Data/Ham/Set4/107687.txt prob = 0.999632042904 prob('python.') = 0.01 prob('alteration') = 0.01 prob('edinburgh') = 0.01 prob('subject:Python') = 0.01 prob('header:Errors-To:1') = 0.0216278 prob('thanks,') = 0.0319955 prob('help?') = 0.041806 prob('road,') = 0.0462364 prob('there,') = 0.0722794 prob('us.') = 0.906609 prob('our') = 0.919118 prob('company,') = 0.921852 prob('visit') = 0.930785 prob('sent.') = 0.939882 prob('e-mail') = 0.949765 prob('courses') = 0.954726 prob('received') = 0.955209 prob('analyst') = 0.960756 prob('investment') = 0.975139 prob('regulated') = 0.99 prob('e-mails') = 0.99 prob('mills') = 0.99 Received: from [195.171.5.71] (helo=node401.dmz.standardlife.com) by mail.python.org with esmtp (Exim 3.21 #1) id 15rDsu-00085k-00 for python-list@python.org; Wed, 10 Oct 2001 03:34:32 -0400 Received: from slukdcn4.internal.standardlife.com (slukdcn4.standardlife.com [10.3.2.72]) by node401.dmz.standardlife.com (Pro-8.9.3/Pro-8.9.3) with SMTP id IAA53660for ; Wed, 10 Oct 2001 08:34:00 +0100 Received: from sl079320 ([172.31.88.231]) by slukdcn4.internal.standardlife.com (Lotus SMTP MTA v4.6.6 (890.1 7-16-1999)) with SMTP id 80256AE1.00294B60; Wed, 10 Oct 2001 08:31:02 +0100 Message-ID: <007e01c1515d$bb255940$e7581fac@sl079320.internal.standardlife.com> From: "Vickie Mills" To: Subject: Training Courses in Python in UK Date: Wed, 10 Oct 2001 08:32:30 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 4.72.3155.0 X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3155.0 Sender: python-list-admin@python.org Errors-To: python-list-admin@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.0.6 (101270) Precedence: bulk List-Help: List-Post: List-Subscribe: , List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: Hi there, I am looking for you recommendations on training courses available in the UK on Python. Can you help? Thanks, Vickie Mills IS Training Analyst Tel: 0131 245 1127 Fax: 0131 245 1550 E-mail: vickie_mills@standardlife.com For more information on Standard Life, visit our website http://www.standardlife.com/ The Standard Life Assurance Company, Standard Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland (No SZ4) and regulated by the Personal Investment Authority. Tel: 0131 225 2552 - calls may be recorded or monitored. This confidential e-mail is for the addressee only. If received in error, do not retain/copy/disclose it without our consent and please return it to us. We virus scan all e-mails but are not responsible for any damage caused by a virus or alteration by a third party after it is sent. ************************************************************************ The top 30 discriminators are more interesting now: 'income' 629 0.99 'http0:python' 643 0.01 'header:MiME-Version:1' 672 0.99 'http1:remove' 693 0.99 'content-type:text/html' 711 0.982345 'string' 714 0.01 'http>1:jpg' 776 0.99 'object' 813 0.01 'python,' 852 0.01 'python.' 882 0.01 'language' 883 0.01 '>>>' 907 0.01 'header:Return-Path:2' 907 0.99 'unsubscribe' 975 0.99 'header:Received:7' 1113 0.99 'def' 1142 0.01 'http>1:gif' 1168 0.99 'module' 1169 0.01 'import' 1332 0.01 'header:Received:8' 1342 0.99 'header:Errors-To:1' 1377 0.0216278 'header:In-Reply-To:1' 1402 0.01 'wrote' 1753 0.01 ' ' 2067 0.99 'subject:Python' 2140 0.01 'header:User-Agent:1' 2322 0.01 'header:X-Complaints-To:1' 4351 0.01 'wrote:' 4370 0.01 'python' 4972 0.01 'header:Organization:1' 6921 0.01 There are still two HTML clues remaining there (" " and "content-type:text/html"). Anthony's trick accounts for almost a third of these. "Python" appears in 5 of them ('http0:python' means that 'python' was found in the 1st field of an embedded http:// URL). Sticking a .gif or a .jpg in a URL both score as 0.99 spam clues. Note the damning pattern of capitalization in 'header:MiME-Version:1'! This counting is case-sensitive, and nobody ever would have guessed that MiME is more damning than SUBJECT or DATE. Why would spam be likely to end up with two instances of Return-Path in the headers?