What is spam?
Spam is basically irrelevant information or messages we get through email or by search engine results. Email Spam : Junk Email ,Unsolicited Commercial Email for making Advertisements and offers. Web Spam : A page created for the sole purpose of attracting search engine referrals (to this page or some other “target” page)
4/13/2013
Problem of Spam
Users don’t want spam
Lost productivity Offensive, Embarrassing Legitimate messages get lost in the sea of spam
Spam isn’t going away
People buy from spammers Legislation has not been effective The SMTP protocol is inadequate ○ It allows spammers to forge message information
Spam is difficult to detect
Spammers learn how to get past filters Legitimate messages WILL be lost
4/13/2013
Spam Categories
4/13/2013
Spam Detection : EmailSpam
Automated Spam filtering : An instance of Document classification problems First document set predefines class(spam or legitimate) training set Second document set no class labels testing purpose
4/13/2013
Problem of Document Classification
4/13/2013
Naïve Bayesian Approach
Based on Bayes Theorem and total probability.
the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email.
4/13/2013
Naïve Bayesian classifier is based on Bayes theorem and the theorem of total probability. For an email instance, the probability that it belongs to class C having a Vector of words X = (x1, x2, x3………xN ) is
Where J € (Spam, Legitimate). In practice, the probabilities P(X|Ci) are impossible to estimate without simplifying assumptions, because the possible values of X are too many .
4/13/2013
Spam Detection : WebSpam
Types of Spamming Techniques
Term spamming
Manipulating the text of web pages in order to
appear relevant to queries
Link spamming
Creating link structures that boost page rank
or hubs and authorities scores
4/13/2013
Link Spam
Three kinds of web pages from a spammer’s point of view
Inaccessible pages Accessible pages ○ e.g., web log comments pages ○ spammer can post links to his pages Own pages ○ Completely controlled by spammer ○ May span multiple domain names
4/13/2013
Detecting Spam
Term spamming
Analyze text using statistical methods e.g.,
Naïve Bayes classifiers Similar to email spam filtering Also useful: detecting approximate duplicate pages
Link spamming
Open research area
One approach: TrustRank
4/13/2013
Trust Rank
Basic principle: approximate isolation
It is rare for a “good” page to point to a “bad”
(spam) page
Sample a set of “seed pages” from the web. Set trust of each trusted page to 1 Propagate trust through links Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam
4/13/2013
Anti-Trust Approach
Broadly based on the same “approximate isolation principle” This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. A page can be classified as a spam page if it has Anti-Trust Rank value more than a chosen threshold value.