Programming Python

Published on January 2017 | Categories: Documents | Downloads: 58 | Comments: 0 | Views: 218

of 10

Content

DETECTING PHRASE-LEVEL DUPLICATION ON THE WORLD WIDE WEB
Dennis Fetterly, Mark Manasse Marc Najork Microsoft Research SIGIR’05

CSE 450 Web Mining Seminar Presented by y Liangjie gj Hong g March 24th, 2008
1

BACKGROUND
|

Types of Spam
Content Spam y Link Li k Spam S y Redirection Spam
y

|

Content Spam
Keyword stuffing y Hidden text y Meta stuffing
y

2

MOTIVATION
|

Keyword Stuffing
Page duplication y Word W d duplication d li ti y Phrase Phrase-level duplication
y

|

Characteristics
Grammatically well-formed y Generated randomly y Assembled from various pages
y
3

FINDING PHRASE REPLICATION
|

Representation of Documents

Shingle
Document

word word word … n

k phrase k-phrase k-phrase k-phrase … n

fingerprint fingerprint fingerprint … m

In their practice, m = 84 k = 5

4

FINDING PHRASE REPLICATION
|

Popular Shingles
numbers & letters y navigational i ti lt text t y copyright notices y machine generated
y

5

FINDING PHRASE REPLICATION
|

Some Results with Popular Shingles

6

COVERING SETS
Shingle Shingle Shingle …

Document

Shingle Shi l Shingle Shingle …

Shingle Shingle Shingle g …

Shingle Shingle Shingle …

…

Finding g the minimum size of covering g set is NP NP-complete p y Using Greedy heuristic to approximate y More likely add documents from other hosts
y

7

COVERING SETS
|

Two Examples of Covering Sets

8

COVERING SETS
|

Some Results about Covering Sets

9

CONCLUSIONS & FUTURE WORK
A third of the pages on the web consists of more replicated than original content | High Hi h fraction f ti of f non non-original i i l phrases h t i ll feature typically f t machinemachine -generated content | Most popular phrases are not very interesting | Provide a way to estimate how original the content is.
| |

Cannot distinguish legitimate from spam content !

10

Programming Python

Comments

Content

Sponsor Documents

Recommended