2014 Linux Symposium, July 14-16

Performance improvement of spamassassin

Geet kaur sukhmani (gaurukhmani@gmail.com)

Email spam, also known as junk email or unsolicited bulk email, is a subset of electronic spam involving nearly identical messages sent to numerous recipients by email. Email spam has steadily grown since the early 1990s. We use SpamAssassin 3.3.2 which is widely-used open source heuristic-based spam filter that applies a large number of weighted tests to a message, sums the results of the tests, and labels the message as spam if the sum exceeds a user-defined threshold. SpamAssassin uses various technologies to determine whether an email is spam or not. Following filters are used Header analysis, Text Analysis, Blacklists and RBLs, Bayesian Filters, Hash Databases. These filters are checked against email sequentially. When ruleset files become bulky, it takes more time to test. Some rules like body or Optical Character Recognition rules are slower. In case of legitimate email, all rules are tested. In this project we try to make this task parallel. There are two areas where parallel processing can be used. Distributing emails in parallel and also, testing email against different rule types in parallel and then aggregate the score. Thus by testing email against different rule types, performance of SpamAssassin is improved significantly.

Policies   |   Media Archives