What's the best way to train SpamAssassin during continuous operation? I have a script running that calls sa-learn everytime a user moves an email in or out of the Spam folder on my Dovecot IMAP server. However, this happens 99% of the time when a spam email wasn't detected by SpamAssassin and the user manually moved it to the spam folder. Ham is only trained when a user finds a misclassified email in the spam folder and moves it back to the inbox.
Over time, this leads to vastly more spam training data than ham data. I am worried that this limits the effectiveness of SpamAssassin.
- Is this the case? Will asymmetric training affect SpamAssassin's effectiveness?
- What's the best way to solve this? Run a regular script (cron-job or similar) on the inboxes of users? How to make sure that the inbox in question is good training data?
- Is there any way or project that tackles this problem already, or do I need to do this manually?
Please note that this question concerns automatic retraining during operation after the initial training, so I am not asking about initializing the classifier with good training data.