Return-Path: tim.one@comcast.net Delivery-Date: Fri Sep 6 15:45:27 2002 From: tim.one@comcast.net (Tim Peters) Date: Fri, 06 Sep 2002 10:45:27 -0400 Subject: [Spambayes] test sets? In-Reply-To: <15736.5577.157228.229200@12-248-11-90.client.attbi.com> Message-ID: [Tim] > OTOH, my Data/ subtree currently has more than 35,000 files slobbering > over 134 million bytes -- even if I had a place to put that much stuff, > I'm not sure my ISP would let me email it in one msg . [Skip] > Do you have a dialup or something more modern ? Much more modern: a cable modem with a small upload rate cap. There's a reason the less modern uncapped @Home went out of business . > 134MB of messages zipped would probably compress pretty well - under 50MB > I'd guess with all the similarity in the headers and such. You could zip > each of the 10 sets individually and upload them somewhere. I suppose this could finish over the course of an afternoon. Now where's "somewhere"? I expect we'll eventually collect several datasets; SourceForge isn't a good place for it (they expect projects to distribute relatively small code files, and complain if even those get big). > ... > How about random sampling lots of public mailing lists via gmane or > something similar, manually cleaning it (distributing that load over a > number of people) and then relying on your clever code and your > rebalancing script to help further cleanse it? What then are we training the classifier to do? Graham's scoring scheme is based on an assumption that the ham-vs-spam task is *easy*, and half of that is due to that the ham has a lot in common. It was an experiment to apply his scheme to all the comp.lang.python traffic, which is a lot broader than he had in mind (c.l.py has long had a generous definition of "on topic" ). I don't expect good things to come of making it ever broader, *unless* your goal is to investigate just how broad it can be made before it breaks down. > The "problem" with the ham is it tends to be much more tied to one person > (not just intimate, but unique) than the spam. Which is "a feature" from Graham's POV: the more clues, the better this "smoking guns only" approach should work. > I save all incoming email for ten days (gzipped mbox format) before it rolls > over and disappears. At any one time I think I have about 8,000-10,000 > messages. Most of it isn't terribly personal (which I would cull before > passing along anyway) and much of it is machine-generated, so would be of > marginal use. Finally, it's all ham-n-spam mixed together. Do we call > that an omelette or a Denny's Grand Slam? Unless you're volunteering to clean it, tag it, package it, and distribute it, I'd call it irrelevant .