DETECTING PHRASE-LEVEL DUPLICATION ON THE WORLD WIDE WEB
Dennis Fetterly, Mark Manasse Marc Najork Microsoft Research SIGIR’05
CSE 450 Web Mining Seminar Presented by y Liangjie gj Hong g March 24th, 2008
1
BACKGROUND
|
Types of Spam
Content Spam y Link Li k Spam S y Redirection Spam
y
|
Content Spam
Keyword stuffing y Hidden text y Meta stuffing
y
2
MOTIVATION
|
Keyword Stuffing
Page duplication y Word W d duplication d li ti y Phrase Phrase-level duplication
y
|
Characteristics
Grammatically well-formed y Generated randomly y Assembled from various pages
y
3
FINDING PHRASE REPLICATION
|
Representation of Documents
Shingle
Document
word word word … n
k phrase k-phrase k-phrase k-phrase … n
fingerprint fingerprint fingerprint … m
In their practice, m = 84 k = 5
4
FINDING PHRASE REPLICATION
|
Popular Shingles
numbers & letters y navigational i ti lt text t y copyright notices y machine generated
y
5
FINDING PHRASE REPLICATION
|
Some Results with Popular Shingles
6
COVERING SETS
Shingle Shingle Shingle …
Document
Shingle Shi l Shingle Shingle …
Shingle Shingle Shingle g …
Shingle Shingle Shingle …
…
Finding g the minimum size of covering g set is NP NP-complete p y Using Greedy heuristic to approximate y More likely add documents from other hosts
y
7
COVERING SETS
|
Two Examples of Covering Sets
8
COVERING SETS
|
Some Results about Covering Sets
9
CONCLUSIONS & FUTURE WORK
A third of the pages on the web consists of more replicated than original content | High Hi h fraction f ti of f non non-original i i l phrases h t i ll feature typically f t machinemachine -generated content | Most popular phrases are not very interesting | Provide a way to estimate how original the content is.
| |