| PrismEmail
offers Bayesian Spam
Filtering. Bayesian filtering is a method of filtering
spam using statistics. This approach is up to 99.5% effective.
This means that if you currently receive 20 spams per day,
Bayesian can reduce that to one spam every 10 days or so.
Bayesian filtering "learns" from the good email and spam you
receive. That means that, over time, your Bayesian filter
will actually become more effective and you'll receive
less and less spam.
To help your Bayesian filter learn correctly, the user has
two responsibilities:
- Report spam. If any spam gets through, the user
should report it. This lets the Bayesian filter learn from
its mistake and make it better able to detect similar spam
in the future.
- Report false positives. If any valid email is incorrectly
flagged as spam, the user needs to report it to the system
so that the Bayesian filter learns how to identify good
email.
PrismEmail
makes it very easy to report either of these two
types of incidents. However, as time goes on, the Bayesian filter
will become more and more accurate and it will become less and
less common for the user to have to make these reports.
As a user, that's all you need to know. However, if you'd
like to know more about how Bayesian filtering works, feel
free to read the rest of this page. We will explain this approach
to spam filtering here in an easy-to-read and understandable
format. Those who are interested in a very thorough and complete
technical explanation of how this works are invited to visit
Paul
Graham's site. Mr. Graham has an excellent site with a
large amount of space dedicated to explaining the how's and
why's of this approach to spam filtering.
THE BAYESIAN METHOD
The Bayesian approach is actually quite simple: It calculates
the probability that a given message is spam or not based
on the contents of that message and based on the contents
of past messages and past spam that you have received. This
is not based on a reactive filter as is the case with traditional
approaches. It uses past good email and past spam that you
have received as a predictor to determine whether a new message
is probably spam or not.
There's a few numbers and percentages in the next few paragraphs.
Don't let it scare you. It doesn't get mathematical, it just
uses some numbers to illustrate a point.
Let's say you have received 1000 good emails and 1000 spams.
The word "click" appeared in 35 of the good emails and appeared
in 750 of the spam emails. While we won't go into the mathematics
behind it, Bayesian statistics tells us that if the word click
appears in 35 of 1000 good emails and in 750 of 1000 spam
emails, then the presence of the word "click" means that the
given message has a 95.54% chance of being spam.
Further, let's say that the word "sex" has a 98.62% chance
of being spam. Given no other information about the email,
if a future message contains both the word "click" and
the word "sex" Bayesian statistics tells us that there is
a 99.93% chance that that message is spam.
But what if these words were in a message between you and
a friend named Tom and were part of the phrase: "The cat's
sex is male. Getting medicine for him is one click away on
some pet website. Do you want to buy it, Tom?" This is an
innocent message, not spam. But according to what we just
said there's a 99.93% chance that it's spam. Not quite.
Bayesian spam filtering doesn't just consider the "bad"
words to determine spam, it also considers the good words.
For example, maybe 40 of 1000 of your good emails contain
the word "Tom" (since you often receive email from him) but
only 3 spams of 1000 contained Tom's name. Given that information
on the word "Tom", Bayesian statistics tells us that if a
message contains the word "Tom" that there's only a 6.98%
chance that that message is spam. Further, let's say that
no spam ever contained the word "cat" and a few of your good
emails did. So we consider that "cat" only has a 1% probability
of being spam.
Bayesian tells us that if the message contains the words
"click", "sex", and "Tom" and "cat" that the spam probability
is only 53.8%. Of course, in reality we don't just consider
the two best and two worst words. We consider all the
unusually good and bad words in the message. Using the Bayesian
approach, a spam message usually has a very high probability
of being spam. That is to say, most spams usually rank between
90% and 99%. Very few rank lower. So with Bayesian filtering,
we can say that "Any message with a probability of over 90%
is spam, anything else is good." With such an approach a message
like the one above would get through. Almost all spam, however,
would not have made it through since it would probably not
contain the words "Tom" and "cat" which reduced the spam probability
for that message.
BAYESIAN "LEARNS" AUTOMATICALLY FOR
EACH USER
The beauty of the Bayesian approach is that it "learns" automatically,
and does so for each user independently. The spam probability
for the word "Tom" might be 6.98% if you know someone named
Tom, but if you don't know anyone named Tom the probability
might be 98%. It learns based on your email.
This also means you don't have to tell the system the name
of everyone you know. The system, over time, will automatically
detect those words that are normally part of good email and
will also detect those words and features that are normally
an indication of spam.
HOW IT WORKS AT PRISMEMAIL
The above system is handled automatically by PrismEmail
,
if you wish. The Bayesian filter is optional, although for
best results we highly recommend you use it.
Using the Bayesian filter does require a small amount
of responsibility on your part as the user. Since the
Bayesian filter learns, it must be told if it makes a mistake
so that the same mistake isn't made in the future. That means
that if a spam gets through to your inbox you must report
that message as spam so that the Bayesian filter can do the
proper statistics in the future. Likewise, if a message is
caught as spam you must tell PrismEmail
that it wasn't spam
so that the Bayesian filter can learn from that. It's very
important that the user report missed spam and also report
email that was incorrectly deemed to be spam. Failing to do
this will cause the Bayesian filters to "learn" incorrectly
leading to poor performance.
Every message you download from PrismEmail
has a link in
the message headers to report that message as spam. If you
receive a message that is spam and has made it through to
you, just click that link in the message header. That's all
it takes to report the message as spam and have the Bayesian
filter learn accordingly.
Likewise, if you notice a message was caught as spam then
you can either login to your PrismEmail
account at this website
and mark the mail for downloading, or you may wait until you
receive the spam summary message once per day that indicates
all the spam captured in the last 24 hours--if you click the
link to download one of those captured messages, the Bayesian
filter will assume it got it wrong and adjust accordingly.
The good news is that you will have to make these corrections
less and less often the more you use it. Once your Bayesian
filter starts to get tuned, it will automatically be able
to detect almost all spam--and the new tricks and new words
that spammers start using will also be noticed by Bayesian
and included in the statistics. So even if spammers start
using new techniques to try to avoid spam filters, your Bayesian
filter will probably adapt to that quickly.
In effect, you just need to give the Bayesian filter a little
help in the beginning. Once you give it a few pushes in the
right direction you will find that Bayesian actually starts
teaching itself about the characteristics of your good
email and of spam without you having to manually report it.
PERFORMANCE IMPROVES OVER TIME
Your Bayesian filter will improve over time. In fact,
when you first start using the Bayesian filter, PrismEmail
won't use the Bayesian approach to filter your email. That's
because some amount of good email and spam history must be
accumulated on which to base the decisions. When you first
start there won't be any history so no decisions can be made.
During this time we suggest you use the traditional filters
offered by PrismEmail
which will probably catch 90% of your
spam. You should report those spams that get through which
will cause your Bayesian filter to be tuned appropriately.
When there is sufficient statistical information collected,
PrismEmail
will start using the Bayesian filter to filter
out spam.
Over time, a history of both good email and spam mail will
be built based on the email you receive. As you report errors
to PrismEmail
and as more history is generated, the performance
of the Bayesian filter will improve. According to Paul
Graham, a finely-tuned Bayesian filter can filter up to
99.5% of all spam with no false positives. That means if you
are receiving 20 spams per day right now, with a finely tuned
Bayesian filter your spam should drop to about one spam every
10 days or so.
BENEFITS OF BAYESIAN FILTERING
To you, the email user, the biggest benefit you'll see is
a drastic reduction in the amount of spam you receive. As
just mentioned, instead of receiving 20 spams per day you
might receive one spam every 10 days or so. What a relief!
Another benefit is that every user's Bayesian filter is
"tuned" to that user's email. It's not one set of traditional
filters for everyone where if the spammer can find a new way
to word his email he can sneak it through the filters. Since
everyone has a differently tuned Bayesian filter it's almost
impossible for a spammer to prepare a spam that will get through
a significant number of differently tuned Bayesian filters.
Paul Graham also talks about the possible
results of the widespread use of Bayesian filters. Less
spam will reach users which means a lower response rate for
spammers. This means less profits for spammers and, in turn,
a lower motivation to spam in the first place. Even if there
isn't widespread use of Bayesian filters, the immediate benefit
to those users who use it is clear: Less spam. |