Spam Detection Technique

Published on June 2016 | Categories: Types, School Work | Downloads: 48 | Comments: 0 | Views: 357
of x
Download PDF   Embed   Report

a ppt for spam detection technique

Comments

Content

Koushik Mandal Jadavpur University

4/13/2013

What is spam?
Spam is basically irrelevant information or messages we get through email or by search engine results. Email Spam :  Junk Email ,Unsolicited Commercial Email for making Advertisements and offers. Web Spam :  A page created for the sole purpose of attracting search engine referrals (to this page or some other “target” page)

4/13/2013

Problem of Spam


Users don’t want spam
 Lost productivity Offensive, Embarrassing  Legitimate messages get lost in the sea of spam



Spam isn’t going away
  

People buy from spammers Legislation has not been effective The SMTP protocol is inadequate ○ It allows spammers to forge message information



Spam is difficult to detect
Spammers learn how to get past filters Legitimate messages WILL be lost

4/13/2013

Spam Categories


4/13/2013

Spam Detection : EmailSpam
Automated Spam filtering : An instance of Document classification problems First document set  predefines class(spam or legitimate)  training set Second document set  no class labels  testing purpose

4/13/2013

Problem of Document Classification

4/13/2013

Naïve Bayesian Approach


Based on Bayes Theorem and total probability.



the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email.

4/13/2013



Naïve Bayesian classifier is based on Bayes theorem and the theorem of total probability. For an email instance, the probability that it belongs to class C having a Vector of words X = (x1, x2, x3………xN ) is



Where J € (Spam, Legitimate). In practice, the probabilities P(X|Ci) are impossible to estimate without simplifying assumptions, because the possible values of X are too many .

4/13/2013

Spam Detection : WebSpam


Types of Spamming Techniques


Term spamming
 Manipulating the text of web pages in order to

appear relevant to queries


Link spamming
 Creating link structures that boost page rank

or hubs and authorities scores

4/13/2013

Link Spam


Three kinds of web pages from a spammer’s point of view
 Inaccessible pages  Accessible pages ○ e.g., web log comments pages ○ spammer can post links to his pages  Own pages ○ Completely controlled by spammer ○ May span multiple domain names

4/13/2013

Detecting Spam


Term spamming
 Analyze text using statistical methods e.g.,

Naïve Bayes classifiers  Similar to email spam filtering  Also useful: detecting approximate duplicate pages


Link spamming
 Open research area
 One approach: TrustRank

4/13/2013

Trust Rank


Basic principle: approximate isolation
 It is rare for a “good” page to point to a “bad”

(spam) page

Sample a set of “seed pages” from the web. Set trust of each trusted page to 1 Propagate trust through links Each page gets a trust value between 0 and 1  Use a threshold value and mark all pages below the trust threshold as spam

4/13/2013

Anti-Trust Approach
 





Broadly based on the same “approximate isolation principle” This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. A page can be classified as a spam page if it has Anti-Trust Rank value more than a chosen threshold value.

4/13/2013

Q&A

4/13/2013

Thank you

4/13/2013

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close