Spam Deobfuscation using a Hidden Markov Model

To circumvent spam filters, many spammers attempt to obfuscate their emails by deliberately misspelling words or introducing other errors into the text. For example viagra may be written vigra, or mortgage written m0rt gage. Even though humans have little difficulty reading obfuscated emails, most content-based filters are unable to recognize these obfuscated spam words. In this paper, we present a hidden Markov model for deobfuscating spam emails. We empirically demonstrate that our model is robust to many types of obfuscation including misspellings, incorrect segmentations (adding/removing spaces), and substitutions/insertions of non-alphabetic characters. Authors: Honglak Lee, Andrew Y. Ng (2005)
AUTHORED BY
Honglak Lee
Andrew Y. Ng

Abstract

To circumvent spam filters, many spammers attempt to obfuscate their emails by deliberately misspelling words or introducing other errors into the text. For example viagra may be written vigra, or mortgage written m0rt gage. Even though humans have little difficulty reading obfuscated emails, most content-based filters are unable to recognize these obfuscated spam words. In this paper, we present a hidden Markov model for deobfuscating spam emails. We empirically demonstrate that our model is robust to many types of obfuscation including misspellings, incorrect segmentations (adding/removing spaces), and substitutions/insertions of non-alphabetic characters.

Download PDF

No Related Item Available

Leave a Reply

You must be logged in to post a comment