crmtrainer
This script has been written with the CRM114 spam filter
in mind, but can be easily adapted to any bayesian filter. The idea is to provide a simple
way for any remote user in order to train their server-side filter. Reciepe :
- Have an IMAP server with storage in mbox format
- Have CRM114 properly installed and filtering (of course, still untrained).
It should tag email by appending the standard X-CRM114-* header, and
automatically send spam-tagged emails to a Spam folder.
- For every user, identify the path to their Inbox and Spam folders. Have
crmtrainer run regularly on it.
When a spam goes through the user's inbox, the latter will simply move it to its Spam
folder; crmtrainer will notice this and notify CRM114 about its error. Similarly,
the user can move a mislassified ham out of its Spam folder, and crmtrainer will also
retrain the filter. The user only naturally moves emails between its Inbox and Spam folders.
Notes
- It was first written with Mail::Box (for mbox, mh and maildir support) but was
awfully slow (about 100-500x compared to the simple mbox parser in crmtrainer).
- CRM114 works best with train-on-errors (TOE), which is very natural for a human
to train.
- crmtrainer effort is mostly proportionnal to the number of emails to retrain,
and a user will only retrain a few misplaced emails a day when CRM114 corpus is mature.
- crmtrainer should be run from a user conjob (crontab -e). Ideally, it would be triggered
by the IMAP server when a 'move mail' operation occurs.
- crmtrainer does not properly locks the mbox files (but only reads them). It means that
in the worst case it could retrain garbled emails.
Self documentation
sh: 1: crmtrainer: not found
|
Contact : Vincent Caron <vincent _at_ zerodeux.net> |
Last modified on Sunday 26 Jun 2005 15:11 |