Data terstruktur
• Sejauh ini kita berurusan dengan data terstruktur,
Attribute Value
Attribute Value
Attribute Value
Attribute Value
Outlook Sunny
Temperature Hot
Windy Yes
Humidity High
Play Yes
• Umumnya data mining menggunakan data semacam ini
1
5/12/2014
5/12/2014
Complex Data Types
• Berkembangnya data complex
• Spatial data: geographic data, medical &
satellite images
• Multimedia data: images, audio, & video
• Time-series data: banking data & stock
exchange data
• Text data: word descriptions for objects
• World-Wide-Web: highly unstructured text &
multimedia data
5/12/2014
Basisdata Teks
• Dalam prakteknya terdapat banyak basis data teks:
• artikel berita
• paper riset
• buku
• perpustakaan digital
• e-mail
• halaman web
• Berkembang dengan cepat baik dari segi jumlah maupun
kepentingan (80%)
2
5/12/2014
Text Mining
• Text mining merujuk pada data mining yang
menggunakan dokumen teks sebagai data
• Hampir semua tugas Text Mining menggunakan metode
Information Retrieval (IR) untuk pra-proses dokumen
teks.
• Metode ini sedikit berbeda daripada metode pra-proses
data yang digunakan dalam tabel relasional
• Web search juga berakar pada IR
CS583, Bing Liu,
UIC
Definisi Text Mining
• Discover useful and previously unknown
“gems” of information in large text collections
3
5/12/2014
Definisi Text Mining
Text Mining is understood as a process of automatically
extracting meaningful, useful, previously unknown and
ultimately comprehensible information from textual
document repositories.
Text Mining
=
Data Mining (applied to text data)
+
basic linguistics
Definisi
• “yang tidak diketahui sebelumnya” ?
• Definisi ketat
• Informasi yang bahkan penulisnya tidak mengetahui
• Contoh: menemukan metode baru untuk pertumbuhan rambut
yang merupakan efek samping dari suatu prosedur
• Definisi longgar
• Menemukan kembali informasi yang telah ditulis pengarang
dalam teksnya
• Contoh: secara otomatis mengekstrak nama produk dari sebuah
halaman web
4
5/12/2014
Text Mining Tasks
• Diberikan:
• Sumber dokumen tekstual
• Kueri terbatas (berbasis teks) yang didefinisikan dengan baik
• Temukan:
• Kalimat dengan informasi relevan
• Ekstrak informasi relevan & abaikan informasi yang tidak relevan
• Hubungkan informasi & keluaran yang saling berhubungan dalam
format yang sudah ditetapkan sebelumnya
Tasks addressed by TM
• Search and retrieval
• Semantic analysis
• Clustering
• Categorization
• Feature extraction
• Ontology building
• Dynamic focusing
5
5/12/2014
DM vs TM
Data Mining
Object of
investigation
Numerical and categorical
data
Object structure Relational databases
Text Mining
Texts
Free form texts
Goal
Predict outcomes of future
situations
Retrieve relevant information,
distill the meaning,
categorize and target-deliver
Methods
Machine learning: SKAT,
DT, NN, GA, MBR, MBA
Indexing, special neural network
processing, linguistics,
ontologies
Current market
size
100,000 analysts at large
and midsize companies
100,000,000 corporate workers
and individual users
Maturity
Broad implementation
since 1994
Broad implementation starting
2000
“Search” vs “Discover”
Structured
Data
Unstructured
Data (Text)
Search
(goal-oriented)
Discover
(opportunistic)
Data
Retrieval
Data
Mining
Information
Retrieval
Text
Mining
6
5/12/2014
Aplikasi Text Mining
• Pemasaran: Menemukan
kelompok pembeli yang
potensial berdasarkan profil
teks pengguna
• contoh. amazon
• Industri: Mengidentifikasi
situs web kelompok pesaing
• Produk pesaing dan harganya
• Pencarian kerja:
mengidentifikasi parameter
dalam pencarian pekerjaan
•
www.flipdog.com
Aplikasi Text Mining
• Search engines
• Enterprise portals
• Knowledge management systems
• e-Business systems
• Vertical applications:
• e-mail categorization and routing
• Call center notes categorization
• CRM systems
assign document IDs
document
numbers
and *field
numbers
break into tokens
tokens
stop list*
non-stoplist
tokens
*Indicates
optional
operation.
stemming*
stemmed
terms
term weighting*
terms with
weights
Inverted file
system
Text Mining
Sample
Documents
Text document
Transformed
Representation
models
Learning
Learning
Working
Domain specific
templates/models
Visualizations
9
5/12/2014
Text characteristics: Outline
• Large textual data base
• High dimensionality
• Several input modes
• Dependency
• Ambiguity
• Noisy data
• Not well structured text
Text characteristics
• Large textual data base
• Efficiency consideration
• over 2,000,000,000 web pages
• almost all publications are also in electronic form
• High dimensionality (Sparse input)
• Consider each word/phrase as a dimension
• Several input modes
• e.g., Web mining: information about user is generated
by semantics, browse pattern and outside
knowledgebase.
10
5/12/2014
Text characteristics
• Dependency
• relevant information is a complex conjunction of
words/phrases
• e.g., Document categorization.
Pronoun disambiguation.
• Ambiguity
• Word ambiguity
• Pronouns (he, she …)
• “buy”, “purchase”
• Semantic ambiguity
• The king saw the rabbit with his glasses. (8 meanings)
Text characteristics
• Noisy data
• Example: Spelling mistakes
• Not well structured text
• Chat rooms
• “r u available ?”
• “Hey whazzzzzz up”
• Speech
11
5/12/2014
Text mining process
Text mining process
• Text preprocessing
• Syntactic/Semantic text
analysis
• Features Generation
• Bag of words
• Features Selection
• Simple counting
• Statistics
• Text/Data Mining
• Classification- Supervised
learning
• Clustering- Unsupervised
learning
• Analyzing results
12
5/12/2014
Syntactic / Semantic text analysis
• Part of Speech (pos) tagging
• Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun)
• ~98% accurate.
• Word sense disambiguation
• Context based or proximity based
• Very accurate
• Parsing
• Generates a parse tree (graph) for each sentence
• Each sentence is a stand alone graph
Feature Generation: Bag of words
• Text document is represented by the words it contains
(and their occurrences)
• e.g., “Lord of the rings” {“the”, “Lord”, “rings”, “of”}
• Highly efficient
• Makes learning far simpler and easier
• Order of words is not that important for certain applications
• Stemming: identifies a word by its root
• e.g., flying, flew fly
• Reduce dimensionality
• Stop words: The most common words are unlikely to help
text mining
• e.g., “the”, “a”, “an”, “you” …
13
5/12/2014
Feature Generation: D2K Example
Hi,
Here is your weekly update (that unfortunately hasn't gone
out in about a month). Not much action here right now.
1) Due to the unwavering insistence of a member of the
weekly update (that unfortunately
gone out
group, thehi,
ncsa.d2k.modules.core.datatype
package
is month).
much action
here right
now.
1) application.
due unwavering insistence
now completely
independent
of the
d2k
member group, ncsa.d2k.modules.core.datatype package
2) Transformations
are now
handled differently
in Tables.
now completely
independent
d2k application.
2)
Previously,transformations
transformations
were
done using
a
now
handled
differently
tables. previously,
TransformationModule.
That
module
could
then
be
added
transformations done using transformationmodule. module
to a list that
an ExampleTable
kept.kept.
Now,now,
there
is an called
added
list exampletable
interface
interface called
Transformation
and a sub-interface
called
transformation
sub-interface
called
hi
week
update
unfortunate
go
out
month much action here
ReversibleTransformation.
reversibletransformation.
right now 1 due unwaver insistence member group ncsa d2k
modules core datatype package now complete independence
d2k application 2 transformation now handle different table
previous transformation do use transformationmodule module
add list exampletable keep now interface call transformation
sub-interface call reversibletransformation
Feature Generation: XML
•
Current keyword-oriented search engines cannot handle rich
queries like
• Find all books authored by “Scooby-Doo”.
•
XML: Extensible Markup Language
• XML documents have a nested structure in which each
element is associated with a tag.
• Tags describe the semantics of elements.
<book> <title> The making of a bad movie </title>
<author> <name> Scooby-Doo </name>
<affiliation> Cartoons </affiliation> </author>
</book>
14
5/12/2014
Feature selection
• Reduce dimensionality
• Learners have difficulty addressing tasks with high
dimensionality
• Irrelevant features
• Not all features help!
• e.g., the existence of a noun in a news article is unlikely to help
classify it as “politics” or “sport”
Feature selection: D2K Example I
hi
week
update
unfortunate
go
out
month
much
action
here
right
now
1
due
unwaver
insistence
member
group
ncsa
d2k
modules
do
core
datatype
package
complete
independence
application
hi
2
transformationweek
update
handle
unfortunate
different
go
table
out
previous
month
use
much
transformationmodule
action
add
here
list
exampletable right
now
keep
due
interface
insistence
call
sub-interface member
group
reversibletransformation
ncsa
d2k
modules
do
core
datatype
package
complete
independence
application
transformation
handle
different
table
previous
use
add
list
keep
interface
call
sub-interface
15
5/12/2014
Feature selection: D2K Example II
hi
week
update
unfortunate
go
out
month
much
action
here
right
now
1
due
unwaver
insistence
member
group
ncsa
d2k
modules
do
core
datatype
package
complete
independence
application
hi
2
transformationweek
update
handle
unfortunate
different
go
table
out
previous
month
use
much
transformationmodule
action
add
here
list
exampletable right
now
keep
due
interface
insistence
call
sub-interface member
group
reversibletransformation
ncsa
d2k
modules
do
core
datatype
package
hi
complete
week
independence
update
application
unfortunate
transformation
month
handle
action
different
right
table
previous
due
use
insistence
add
member
list
group
keep
ncsa
interface
d2k
call
modules
sub-interface
core
datatype
package
complete
independence
application
transformation
handle
different
table
previous
add
list
interface
call
sub-interface
Text Mining: Classification definition
• Given: a collection of labeled records (training set)
• Each record contains a set of features (attributes), and the
true class (label)
• Find: a model for the class as a function of the
values of the features
• Goal: previously unseen records should be
assigned a class as accurately as possible
• A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it
16
5/12/2014
Text Mining: Clustering definition
• Given: a set of documents and a similarity measure
among documents
• Find: clusters such that:
• Documents in one cluster are more similar to one another
• Documents in separate clusters are less similar to one another
• Goal:
• Finding a correct set of documents
Similarity Measures:
• Euclidean Distance if attributes are continuous
• Other Problem-specific Measures
• e.g., how many words are common in these documents
Contoh
GREAT Camera., Jun 3, 2004
Reviewer: jprice174 from
Atlanta, Ga.
I did a lot of research last year
before I bought this camera...
It kinda hurt to leave behind
my beloved nikon 35mm
SLR, but I was going to Italy,
and I needed something
smaller, and digital.
The pictures coming out of this
camera are amazing. The
'auto' feature takes great
pictures most of the time. And
with digital, you're not
wasting film if the picture
doesn't come out. …
34
Summary:
Feature1: picture
Positive: 12
• The pictures coming out of this camera
are amazing.
• Overall this is a good camera with a
really good picture clarity.
…
Negative: 2
• The pictures come out hazy if your
hands shake even for a moment during
the entire process of taking a picture.
• Focusing on a display rack about 20 feet
away in a brightly lit room during day
time, pictures produced by this camera
were blurry and in a shade of orange.
Feature2: battery life
…
CS583, Bing Liu,
UIC
17
5/12/2014
Visual Comparison
+
Summary of
reviews of
Digital camera 1
_
Picture
Comparison of
reviews of
Battery
Zoom
Size
Weight
+
Digital camera 1
Digital camera 2
_
35
CS583, Bing Liu,
UIC
Information Extraction
Posting from Newsgroup
Telecommunications. Solaris Systems
Administrator. 55-60K. Immediate need.
3P is a leading telecommunications firm
in need of a energetic individual to
fill the following position in the
Atlanta office:
SOLARIS SYSTEM ADMINISTRATOR
Salary: 50-60K with full benefits
Location: Atlanta, Georgia no relocation
assistance provided
FILLED TEMPLATE
job title: SOLARIS SYSTEM ADMINISTRATOR
salary: 55-60K
city: Atlanta
state: Georgia
platform: SOLARIS
area: Telecommunications
18
5/12/2014
Classification: An Example
10
Ex# Country Marital
Status
Income
1
England Single
125K
2
England Married
3
England Single
70K
Yes
4
Italy
Married
40K
No
5
USA
Divorced 95K
No
6
England Married
7
England
8
Italy
9
France
10
Denmark Single
Hooligan
Yes
Yes
60K
Country Marital
Status
Income
England Single
75K
?
Turkey
50K
?
150K
?
England Married
Yes
20K
Yes
Single
85K
Yes
Married
75K
No
50K
No
Married
Itlay
Hooligan
Divorced 90K
?
Single
40K
?
Married
80K
?
10
Training
Set
Learn
Classifier
Test
Set
Model
19
5/12/2014
Text Classification: An Example
Ex#
Hooligan
1
2
3
4
5
6
7
8
10
An English football fan
…
During a game in Italy
…
England has been
beating France …
Italian football fans were
cheering …
An average USA
salesman earns 75K
The game in London
was horrific
Manchester city is likely
to win the championship
Rome is taking the lead
in the football league
Yes
Hooligan
Yes
Yes
No
A Danish football fan
?
Turkey is playing vs. France.
The Turkish fans …
?
10
No
Test
Set
Yes
Yes
Yes
Training
Set
Learn
Classifier
Model
20
5/12/2014
Web Mining
Data mining – Ilmu Komputer IPB
Web Mining
WWW
Knowledge
21
5/12/2014
Example: Web data extraction
Data
region1
A data
record
A data
record
Data
region2
CS583, Bing Liu,
UIC
43
Align and extract data items (e.g., region1)
image1 EN7410
17-inch
LCD
Monitor
Black/Dark
charcoal
$299.9
9
Add
to
Cart
(Delivery /
Pick-Up )
Penny
Shopping
Compare
image2 17-inch
LCD
Monitor
$249.9
9
Add
to
Cart
(Delivery /
Pick-Up )
Penny
Shopping
Compare
image3 AL1714 17inch LCD
Monitor,
Black
$269.9
9
Add
to
Cart
(Delivery /
Pick-Up )
Penny
Shopping
Compare
$299.9
9
Save
Add
$70
to
After:
Cart
$70 mailinrebate(s)
(Delivery /
Pick-Up )
Penny
Shopping
Compare
image4 SyncMaste
r 712n 17inch LCD
Monitor,
Black
Was:
$369.9
9
CS583, Bing Liu, UIC
22
5/12/2014
Ads vs. search results
Reproduced from Ullman & Rajaraman with permission
Ads vs. search results
Search advertising is the revenue model
• Multi-billion-dollar industry
• Advertisers pay for clicks on their ads
Interesting problems
• How to pick the top 10 results for a search from
2,230,000 matching pages?
• What ads to show for a search?
• If I’m an advertiser, which search terms should I bid on
and how much to bid?
Reproduced from Ullman & Rajaraman with permission
23
5/12/2014
What’s Web Mining?
Discovering interesting and useful
information from Web content and usage
• Web search : Google, Yahoo,
• Advertising, e.g. Google Adsense
MSN, Ask, …
• Fraud detection: click fraud
• Specialized search: e.g. Froogle
detection, …
(comparison shopping), job ads • Improving Web site design and
(Flipdog)
performance
• eCommerce :
• Recommendations: e.g. Netflix,
Amazon
• improving conversion rate: next
best product to offer
May 12, 2014
Web Mining
Web Mining
• Web mining - data mining techniques to
automatically discover and extract information
from Web documents/services (Etzioni, 1996).
• Web mining research – integrate research from
several research communities (Kosala and
Blockeel, July 2000) such as:
• Database (DB)
• Information retrieval (IR)
• The sub-areas of machine learning (ML)
• Natural language processing (NLP)
24
5/12/2014
5/12/2014
Web Mining
• The World Wide Web may have more opportunities
for data mining than any other area
• However, there are serious challenges:
• It is too huge
• Complexity of Web pages is greater than any traditional
text document collection
• It is highly dynamic
• It has a broad diversity of users
• Only a tiny portion of the information is truly useful
How big is the Web ?
Technically,
infinite
Because of dynamically
generated content
Lots of duplication (30-40%)
Number
of pages
Best estimate of
“unique” static HTML
pages comes from
search engine claims
Google = 8 billion,
Yahoo = 20 billion
Lots of marketing
hype
Reproduced from Ullman & Rajaraman with permission
25
5/12/2014
Why Mine the Web?
• Enormous wealth of textual information on the Web.
• Book/CD/Video stores (e.g., Amazon)
• Restaurant information (e.g., Zagats)
• Car prices (e.g., Carpoint)
• Lots of data on user access patterns
• Web logs contain sequence of URLs accessed by users
• Possible to retrieve “previously unknown” information
• People who ski also frequently break their leg.
• Restaurants that serve sea food in California are likely to be outside
San-Francisco
In the May 2014, 975,262,468 sites
— 16 million more than last month
Unique Features of the Web
• The Web is a huge collection of documents
where many contain:
• Hyper-link information
• Access and usage information
• The Web is very dynamic
• Web pages are constantly being generated (removed)
Challenge: Develop new Web mining algorithms to . . .
•Exploit hyper-links and access patterns.
•Be adaptable to its documents source
Web Mining vs Data Mining
Structure
• Web is not relation
• Textual information and linkage structure
Scale
• Usage data is huge and growing rapidly
• Data generated per day is comparable to
largest conventional data warehouses
Speed
• Often need to react to evolving usage
patterns in real-time (e.g., merchandising)
• No human in the loop
27
5/12/2014
May 12, 2014
Web Mining
Web Mining Taxonomy
Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining Taxonomy
Web Mining
Web Content
Mining
Web Page
Content Mining
Identify information
within given web
pages
Web Structure
Mining
Search Result
Mining
Categorizes documents
using phrases in titles
and snippets
Uses interconnections between
web pages to give weight to
pages
Web Usage
Mining
General Access
Pattern Tracking
Understand access
patterns and trends to
improve structure
Customized
Usage Tracking
Analyzes access
patterns of a user to
improve response
Distinguish personal
home pages from
other web pages
28
5/12/2014
May 12, 2014
Web Mining
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Structure
Web Usage
Mining
Mining
Web Page Content Mining
Web Page Summarization
WebOQL(Mendelzon et.al. 1998) …:
Customized
Web Structuring query languages; Search Result General Access
Pattern
Tracking
Usage
Tracking
Mining
Can identify information within
given web pages
•(Etzioni et.al. 1997):Uses heuristics to
distinguish personal home pages
from other web pages
•ShopBot (Etzioni et.al. 1997): Looks
for product prices within web
pages
May 12, 2014
Web Mining
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Web Structure
Mining
Web Usage
Mining
Search Result Mining
Search Engine Result
Summarization
•Clustering Search Result
(Leouski and Croft, 1996,
Zamir and Etzioni, 1997):
Categorizes documents
using phrases in titles and
snippets
General Access Customized
Pattern Tracking Usage Tracking
29
5/12/2014
May 12, 2014
Web Mining
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure Mining
Using Links
•PageRank (Brin et al., 1998)
•CLEVER (Chakrabarti et al., 1998)
Use interconnections between web pages
General Access
Search Result
to give weight to pages.
Pattern Tracking
Mining
Web Page
Content Mining
Using Generalization
•MLDB (1994)
Uses a multi-level database
representation of the Web. Counters
(popularity) and link lists are used for
capturing structure.
May 12, 2014
Customized
Usage Tracking
Web Mining
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Search Result
Mining
Web Structure
Mining
Web Usage
Mining
General Access Pattern Tracking
•Web Log Mining (Zaïane, Xin and
Han, 1998)
Uses KDD techniques to understand
general access patterns and trends.
Can shed light on better structure and
grouping of resource providers.
Customized
Usage Tracking
30
5/12/2014
May 12, 2014
Web Mining
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Web Structure
Mining
Web Usage
Mining
Customized Usage Tracking
General Access
Pattern Tracking
Search Result
Mining
•Adaptive Sites (Perkowitz and Etzioni,
1997)
Analyzes access patterns of each user at
a time.
Web site restructures itself automatically
by learning from user access patterns.
Web Content Mining Approaches
• Information Retrieval Approach
• To assist or to improve the information finding or
filtering the information to the users usually based
on either inferred or solicited user profiles.
• Database Approach
• To model the data on the Web and to integrated
them so that more sophisticated queries other than
the keywords based could be performed.
31
5/12/2014
5/12/2014
Web Content Mining
View of Data
Main Data
Representation
Methods
Applications
IR View
Unstructured
Semi-structured
Text documents
Hypertext documents
Bag of words, n-grams
Terms, phrases
Concepts or ontology
Relational
Machine Learning
Statistics
Categorization
Clustering
Finding extraction rules
Finding patterns in text
User modeling
May 12, 2014
DB View
Semi-structured
Web site as DB
Hypertext documents
Edge-labeled graph
Relational
ILP
Association rules
Finding frequent substructures
Web site schema discovery
Web Mining
Isu dalam Web Content Mining
• Pengembangan alat cerdar untuk IR
• Mencari kata kunci & frasa kunci
• Menemukan aturan gramatikal & collocation
• Klasifikasi/kategorisasi hyperteks
• Mengekstra frasa kunci dari dokumen html
• Ekstraksi model/aturan pembelajaran
• Hierarchical clustering
• Memprediksi keterhubungan kata
• Membangun web Query system (WebOQL, XMLQL)
• Mining multimedia data
32
5/12/2014
5/12/2014
Web Structure Mining
View of Data
Main Data
Representation
Methods
Applications
Web Structure Mining
• Untuk menemukan struktur link dari hyperlinks
pada level antardokumen untuk membangun
ringkasan struktur tentang situs web
• Arah 1: berbasis hyperlinks, mengkategorikan halaman
Web & informasi yang dibangun
• Arah 2: menemukan struktur dari dokumen web itu
sendiri
• Arah 3: menemukan kealamiahan hierarki/jaringan
hyperlinks pada situsweb tertentu
33
5/12/2014
May 12, 2014
Web Mining
Web Structure Mining
• Menemukan halaman web yg authorative
• Menemukembalikan halaman yang tidak hanya relevan, tapi
juga berkualitas tinggi/authorative terhadap topik
• Hyperlinks dapat merujuk authority
• Web menganfung juga hyperlinks dari satu halaman ke
halaman lain
• Hyperlinks mengandung anotasi manusia berjumlah besar
• Hyperlink yang merujuk ke halaman lain, dapat
dipertimbangkan sebagai kesukaan pengarang terhadap
halaman lain
5/12/2014
Web Usage Mining
View of Data
Main Data
Representation
Methods
Applications
Interactivity
Server logs
Browser logs
Relational table
Graph
Machine learning
Statistics
Association rules
Site construction, adaptation & management
Marketing
User modeling
34
5/12/2014
May 12, 2014
Web Mining
Web Usage Mining
• Web usage mining juga disebut Web
log mining
• Teknik mining untuk menemukan pola
penggunaan yang menarik dari data
sekunder yang diturunkan dari interaksi
pengguna ketika menjelajahi web
May 12, 2014
Web Mining
Web Usage Mining
• Aplikasi
• Menargetkan kostumer yang potensial untuk produk
elektronik
• Memperluas kualitas dan pengantaran Internet
Information Services kepada pengguna akhir.
• Memperbaiki performa sistem web server
• Mengidentifikasi lokasi iklan yang potensial
• Memfasilitasi personalisasi/situs adaptif
• Memperbaki desain situs
• Deteksi fraud/intrusion
• Memprediksi aksi pengguna
35
5/12/2014
May 12, 2014
Web Mining
May 12, 2014
Web Mining
Log Data - Simple Analysis
• Statistical analysis of users
– Length of path
– Viewing time
– Number of page views
• Statistical analysis of site
– Most common pages viewed
– Most common invalid URL
36
5/12/2014
May 12, 2014
Web Mining
Web Log – Data Mining Applications
• Association rules
– Find pages that are often viewed together
• Clustering
– Cluster users based on browsing patterns
– Cluster pages based on content
• Classification
– Relate user attributes to patterns
Common Log Format
• Remotehost: browser hostname or IP #
• Remote log name of user (almost
always "-" meaning "unknown")
• Authuser: authenticated username
• Date: Date and time of the request
• "request”: exact request lines from client
• Status: The HTTP status code returned
• Bytes: The content-length of response
Reproduced from Ullman & Rajaraman with permission
Web search basics
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA
Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam O ven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages
Web crawler
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ]
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages
Search
Indexer
The Web
Indexes
Ad indexes
Reproduced from Ullman & Rajaraman with permission
39
5/12/2014
Mining the Web
Web
Spider
Documents
source
Query
IR / IE
System
1. Doc1
2. Doc2
3. Doc3
.
.
Ranked
Documents
5/12/2014
Data Mining: Principles and Algorithms
Search Engine
Ranking based on link
structure analysis
Search
Rank Functions
Similarity
based on
content or text
Importance Ranking
(Link Analysis)
Relevance Ranking
Backward Link
(Anchor Text)
Indexer
Inverted
Index
Term Dictionary
(Lexicon)
Web Topology
Graph
Anchor Text
Generator
Meta Data
Forward
Index
Forward
Link
Web Graph
Constructor
URL
Dictioanry
Web Page Parser
Web Pages
40
5/12/2014
5/12/2014
Data Mining: Principles and Algorithms
Layout Structure
• Compared to plain text, a web page is a 2D presentation
• Rich visual effects created by different term types, formats,
separators, blank areas, colors, pictures, etc
• Different parts of a page are not equally important
Title: CNN.com International
H1: IAEA: Iran had secret nuke agenda
H3: EXPLOSIONS ROCK BAGHDAD
…
TEXT BODY (with position and font
type): The International Atomic Energy
Agency has concluded that Iran has
secretly produced small amounts of
nuclear materials including low enriched
uranium and plutonium that could be used
to develop nuclear weapons according to a
confidential report obtained by CNN…
Hyperlink:
• URL: http://www.cnn.com/...
• Anchor Text: AI oaeda…
Web Page Block—Better Information Unit
Web Page Blocks
Importance = Low
Importance = Med
Importance = High
41
5/12/2014
Web Usage Mining
Applications:
Simple and Basic:
• Monitor performance, bandwidth usage
• Catch errors (404 errors- pages not found)
• Improve web site design
• (shortcuts for frequent paths, remove links not used, etc)
Advanced and Business Critical :
• eCommerce: improve conversion, sales, profit
• Fraud detection: click stream fraud, …
Web Usage Mining – Three Phases
42
5/12/2014
Web Usage Mining Issues
• Identification of exact user not possible.
• Exact sequence of pages referenced by a user not
possible due to caching.
• Session not well defined
• Security, privacy, and legal issues
Systems Issues
Web data sets can
be very large
• Tens to hundreds of
terabytes
Cannot mine on a
single server!
• Need large farms of
servers
How to organize
hardware/software
to mine multiterabye data sets