Text Dan Web Mining

Published on May 2016 | Categories: Documents | Downloads: 43 | Comments: 0 | Views: 225
of 45
Download PDF   Embed   Report

Teks dan Web Mining

Comments

Content

5/12/2014

Kuliah 12

Text & Web Mining
Data Mining – Ilmu Komputer IPB

Data terstruktur
• Sejauh ini kita berurusan dengan data terstruktur,
Attribute  Value
Attribute  Value
Attribute  Value

Attribute  Value

Outlook  Sunny
Temperature  Hot
Windy  Yes
Humidity  High
Play  Yes

• Umumnya data mining menggunakan data semacam ini

1

5/12/2014

5/12/2014

Complex Data Types
• Berkembangnya data complex

• Spatial data: geographic data, medical &

satellite images
• Multimedia data: images, audio, & video
• Time-series data: banking data & stock
exchange data
• Text data: word descriptions for objects
• World-Wide-Web: highly unstructured text &
multimedia data

5/12/2014

Basisdata Teks
• Dalam prakteknya terdapat banyak basis data teks:
• artikel berita
• paper riset
• buku
• perpustakaan digital
• e-mail
• halaman web
• Berkembang dengan cepat baik dari segi jumlah maupun

kepentingan (80%)

2

5/12/2014

Text Mining
• Text mining merujuk pada data mining yang

menggunakan dokumen teks sebagai data
• Hampir semua tugas Text Mining menggunakan metode
Information Retrieval (IR) untuk pra-proses dokumen
teks.
• Metode ini sedikit berbeda daripada metode pra-proses
data yang digunakan dalam tabel relasional
• Web search juga berakar pada IR

CS583, Bing Liu,
UIC

Definisi Text Mining
• Discover useful and previously unknown

“gems” of information in large text collections

3

5/12/2014

Definisi Text Mining
Text Mining is understood as a process of automatically
extracting meaningful, useful, previously unknown and
ultimately comprehensible information from textual
document repositories.
Text Mining
=
Data Mining (applied to text data)
+
basic linguistics

Definisi
• “yang tidak diketahui sebelumnya” ?
• Definisi ketat
• Informasi yang bahkan penulisnya tidak mengetahui
• Contoh: menemukan metode baru untuk pertumbuhan rambut
yang merupakan efek samping dari suatu prosedur
• Definisi longgar

• Menemukan kembali informasi yang telah ditulis pengarang

dalam teksnya
• Contoh: secara otomatis mengekstrak nama produk dari sebuah
halaman web

4

5/12/2014

Text Mining Tasks
• Diberikan:
• Sumber dokumen tekstual
• Kueri terbatas (berbasis teks) yang didefinisikan dengan baik
• Temukan:
• Kalimat dengan informasi relevan
• Ekstrak informasi relevan & abaikan informasi yang tidak relevan
• Hubungkan informasi & keluaran yang saling berhubungan dalam
format yang sudah ditetapkan sebelumnya

Tasks addressed by TM
• Search and retrieval
• Semantic analysis
• Clustering
• Categorization
• Feature extraction
• Ontology building
• Dynamic focusing

5

5/12/2014

DM vs TM
Data Mining
Object of
investigation

Numerical and categorical
data

Object structure Relational databases

Text Mining
Texts

Free form texts

Goal

Predict outcomes of future
situations

Retrieve relevant information,
distill the meaning,
categorize and target-deliver

Methods

Machine learning: SKAT,
DT, NN, GA, MBR, MBA

Indexing, special neural network
processing, linguistics,
ontologies

Current market
size

100,000 analysts at large
and midsize companies

100,000,000 corporate workers
and individual users

Maturity

Broad implementation
since 1994

Broad implementation starting
2000

“Search” vs “Discover”

Structured
Data
Unstructured
Data (Text)

Search
(goal-oriented)

Discover
(opportunistic)

Data
Retrieval

Data
Mining

Information
Retrieval

Text
Mining

6

5/12/2014

Aplikasi Text Mining
• Pemasaran: Menemukan

kelompok pembeli yang
potensial berdasarkan profil
teks pengguna
• contoh. amazon

• Industri: Mengidentifikasi

situs web kelompok pesaing
• Produk pesaing dan harganya

• Pencarian kerja:

mengidentifikasi parameter
dalam pencarian pekerjaan


www.flipdog.com

Aplikasi Text Mining
• Search engines
• Enterprise portals
• Knowledge management systems
• e-Business systems
• Vertical applications:
• e-mail categorization and routing
• Call center notes categorization
• CRM systems

7

5/12/2014

User
Interface
Text Operations

Query
Operations

Indexing

Searching

INDEX

Ranking
Text
Database

Search Subsystem
query

parse query
query tokens

ranked
document set

stop list*

non-stoplist
tokens

ranking*
stemming*
stemmed
terms
*Indicates
optional
operation.

retrieved document
set

Boolean
operations*

relevant
document set

Inverted file
system

8

5/12/2014

Indexing Subsystem
documents
Documents
text

assign document IDs
document
numbers
and *field
numbers

break into tokens
tokens

stop list*
non-stoplist
tokens

*Indicates
optional
operation.

stemming*
stemmed
terms

term weighting*

terms with
weights

Inverted file
system

Text Mining
Sample
Documents

Text document

Transformed

Representation
models

Learning

Learning

Working

Domain specific
templates/models

Visualizations

9

5/12/2014

Text characteristics: Outline
• Large textual data base
• High dimensionality
• Several input modes
• Dependency
• Ambiguity
• Noisy data
• Not well structured text

Text characteristics
• Large textual data base
• Efficiency consideration
• over 2,000,000,000 web pages
• almost all publications are also in electronic form

• High dimensionality (Sparse input)
• Consider each word/phrase as a dimension
• Several input modes
• e.g., Web mining: information about user is generated
by semantics, browse pattern and outside
knowledgebase.

10

5/12/2014

Text characteristics
• Dependency
• relevant information is a complex conjunction of
words/phrases
• e.g., Document categorization.

Pronoun disambiguation.

• Ambiguity
• Word ambiguity
• Pronouns (he, she …)
• “buy”, “purchase”

• Semantic ambiguity
• The king saw the rabbit with his glasses. (8 meanings)

Text characteristics
• Noisy data
• Example: Spelling mistakes

• Not well structured text
• Chat rooms
• “r u available ?”
• “Hey whazzzzzz up”

• Speech

11

5/12/2014

Text mining process

Text mining process
• Text preprocessing
• Syntactic/Semantic text

analysis

• Features Generation
• Bag of words

• Features Selection
• Simple counting
• Statistics

• Text/Data Mining
• Classification- Supervised

learning
• Clustering- Unsupervised

learning

• Analyzing results

12

5/12/2014

Syntactic / Semantic text analysis
• Part of Speech (pos) tagging
• Find the corresponding pos for each word

e.g., John (noun) gave (verb) the (det) ball (noun)
• ~98% accurate.

• Word sense disambiguation
• Context based or proximity based
• Very accurate

• Parsing
• Generates a parse tree (graph) for each sentence
• Each sentence is a stand alone graph

Feature Generation: Bag of words
• Text document is represented by the words it contains

(and their occurrences)
• e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”}
• Highly efficient
• Makes learning far simpler and easier
• Order of words is not that important for certain applications

• Stemming: identifies a word by its root
• e.g., flying, flew  fly
• Reduce dimensionality
• Stop words: The most common words are unlikely to help

text mining
• e.g., “the”, “a”, “an”, “you” …

13

5/12/2014

Feature Generation: D2K Example
Hi,
Here is your weekly update (that unfortunately hasn't gone
out in about a month). Not much action here right now.
1) Due to the unwavering insistence of a member of the
weekly update (that unfortunately
gone out
group, thehi,
ncsa.d2k.modules.core.datatype
package
is month).
much action
here right
now.
1) application.
due unwavering insistence
now completely
independent
of the
d2k
member group, ncsa.d2k.modules.core.datatype package
2) Transformations
are now
handled differently
in Tables.
now completely
independent
d2k application.
2)
Previously,transformations
transformations
were
done using
a
now
handled
differently
tables. previously,
TransformationModule.
That
module
could
then
be
added
transformations done using transformationmodule. module
to a list that
an ExampleTable
kept.kept.
Now,now,
there
is an called
added
list exampletable
interface
interface called
Transformation
and a sub-interface
called
transformation
sub-interface
called
hi
week
update
unfortunate
go
out
month much action here
ReversibleTransformation.
reversibletransformation.
right now 1 due unwaver insistence member group ncsa d2k
modules core datatype package now complete independence
d2k application 2 transformation now handle different table
previous transformation do use transformationmodule module
add list exampletable keep now interface call transformation
sub-interface call reversibletransformation

Feature Generation: XML


Current keyword-oriented search engines cannot handle rich
queries like
• Find all books authored by “Scooby-Doo”.



XML: Extensible Markup Language
• XML documents have a nested structure in which each
element is associated with a tag.
• Tags describe the semantics of elements.

<book> <title> The making of a bad movie </title>
<author> <name> Scooby-Doo </name>
<affiliation> Cartoons </affiliation> </author>
</book>

14

5/12/2014

Feature selection
• Reduce dimensionality
• Learners have difficulty addressing tasks with high
dimensionality
• Irrelevant features
• Not all features help!
• e.g., the existence of a noun in a news article is unlikely to help

classify it as “politics” or “sport”

Feature selection: D2K Example I
hi
week
update
unfortunate
go
out
month
much
action
here
right
now
1
due
unwaver
insistence
member
group
ncsa
d2k
modules
do

core
datatype
package
complete
independence
application
hi
2
transformationweek
update
handle
unfortunate
different
go
table
out
previous
month
use
much
transformationmodule
action
add
here
list
exampletable right
now
keep
due
interface
insistence
call
sub-interface member
group
reversibletransformation
ncsa
d2k
modules

do
core
datatype
package
complete
independence
application
transformation
handle
different
table
previous
use
add
list
keep
interface
call
sub-interface

15

5/12/2014

Feature selection: D2K Example II
hi
week
update
unfortunate
go
out
month
much
action
here
right
now
1
due
unwaver
insistence
member
group
ncsa
d2k
modules
do

core
datatype
package
complete
independence
application
hi
2
transformationweek
update
handle
unfortunate
different
go
table
out
previous
month
use
much
transformationmodule
action
add
here
list
exampletable right
now
keep
due
interface
insistence
call
sub-interface member
group
reversibletransformation
ncsa
d2k
modules

do
core
datatype
package
hi
complete
week
independence
update
application
unfortunate
transformation
month
handle
action
different
right
table
previous
due
use
insistence
add
member
list
group
keep
ncsa
interface
d2k
call
modules
sub-interface

core

datatype
package
complete
independence
application
transformation
handle
different
table
previous
add
list
interface
call
sub-interface

Text Mining: Classification definition
• Given: a collection of labeled records (training set)
• Each record contains a set of features (attributes), and the
true class (label)
• Find: a model for the class as a function of the

values of the features
• Goal: previously unseen records should be
assigned a class as accurately as possible
• A test set is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it

16

5/12/2014

Text Mining: Clustering definition
• Given: a set of documents and a similarity measure

among documents
• Find: clusters such that:
• Documents in one cluster are more similar to one another
• Documents in separate clusters are less similar to one another

• Goal:
• Finding a correct set of documents

Similarity Measures:
• Euclidean Distance if attributes are continuous
• Other Problem-specific Measures
• e.g., how many words are common in these documents

Contoh
GREAT Camera., Jun 3, 2004
Reviewer: jprice174 from
Atlanta, Ga.
I did a lot of research last year
before I bought this camera...
It kinda hurt to leave behind
my beloved nikon 35mm
SLR, but I was going to Italy,
and I needed something
smaller, and digital.
The pictures coming out of this
camera are amazing. The
'auto' feature takes great
pictures most of the time. And
with digital, you're not
wasting film if the picture
doesn't come out. …
34

Summary:
Feature1: picture
Positive: 12
• The pictures coming out of this camera
are amazing.
• Overall this is a good camera with a
really good picture clarity.

Negative: 2
• The pictures come out hazy if your
hands shake even for a moment during
the entire process of taking a picture.
• Focusing on a display rack about 20 feet
away in a brightly lit room during day
time, pictures produced by this camera
were blurry and in a shade of orange.
Feature2: battery life

CS583, Bing Liu,
UIC

17

5/12/2014

Visual Comparison
+

Summary of
reviews of
Digital camera 1

_
Picture

Comparison of
reviews of

Battery

Zoom

Size

Weight

+

Digital camera 1
Digital camera 2

_
35

CS583, Bing Liu,
UIC

Information Extraction
Posting from Newsgroup
Telecommunications. Solaris Systems
Administrator. 55-60K. Immediate need.
3P is a leading telecommunications firm
in need of a energetic individual to
fill the following position in the
Atlanta office:
SOLARIS SYSTEM ADMINISTRATOR
Salary: 50-60K with full benefits
Location: Atlanta, Georgia no relocation
assistance provided

FILLED TEMPLATE
job title: SOLARIS SYSTEM ADMINISTRATOR
salary: 55-60K
city: Atlanta
state: Georgia
platform: SOLARIS
area: Telecommunications

18

5/12/2014

Classification: An Example

10

Ex# Country Marital
Status

Income

1

England Single

125K

2

England Married

3

England Single

70K

Yes

4

Italy

Married

40K

No

5

USA

Divorced 95K

No

6

England Married

7

England

8

Italy

9

France

10

Denmark Single

Hooligan
Yes
Yes

60K

Country Marital
Status

Income

England Single

75K

?

Turkey

50K

?

150K

?

England Married

Yes

20K

Yes

Single

85K

Yes

Married

75K

No

50K

No

Married

Itlay

Hooligan

Divorced 90K

?

Single

40K

?

Married

80K

?

10

Training
Set

Learn
Classifier

Test
Set

Model

19

5/12/2014

Text Classification: An Example
Ex#
Hooligan
1
2
3
4
5
6
7
8
10

An English football fan

During a game in Italy

England has been
beating France …
Italian football fans were
cheering …
An average USA
salesman earns 75K
The game in London
was horrific
Manchester city is likely
to win the championship
Rome is taking the lead
in the football league

Yes

Hooligan

Yes
Yes
No

A Danish football fan

?

Turkey is playing vs. France.
The Turkish fans …

?

10

No

Test
Set

Yes
Yes
Yes

Training
Set

Learn
Classifier

Model

20

5/12/2014

Web Mining
Data mining – Ilmu Komputer IPB

Web Mining
WWW
Knowledge

21

5/12/2014

Example: Web data extraction
Data
region1
A data
record
A data
record

Data
region2

CS583, Bing Liu,
UIC

43

Align and extract data items (e.g., region1)
image1 EN7410
17-inch
LCD
Monitor
Black/Dark
charcoal

$299.9
9

Add
to
Cart

(Delivery /
Pick-Up )

Penny
Shopping

Compare

image2 17-inch
LCD
Monitor

$249.9
9

Add
to
Cart

(Delivery /
Pick-Up )

Penny
Shopping

Compare

image3 AL1714 17inch LCD
Monitor,
Black

$269.9
9

Add
to
Cart

(Delivery /
Pick-Up )

Penny
Shopping

Compare

$299.9
9

Save
Add
$70
to
After:
Cart
$70 mailinrebate(s)

(Delivery /
Pick-Up )

Penny
Shopping

Compare

image4 SyncMaste
r 712n 17inch LCD
Monitor,
Black

Was:
$369.9
9

CS583, Bing Liu, UIC

22

5/12/2014

Ads vs. search results

Reproduced from Ullman & Rajaraman with permission

Ads vs. search results
Search advertising is the revenue model
• Multi-billion-dollar industry
• Advertisers pay for clicks on their ads

Interesting problems
• How to pick the top 10 results for a search from
2,230,000 matching pages?
• What ads to show for a search?
• If I’m an advertiser, which search terms should I bid on
and how much to bid?
Reproduced from Ullman & Rajaraman with permission

23

5/12/2014

What’s Web Mining?
Discovering interesting and useful
information from Web content and usage
• Web search : Google, Yahoo,

• Advertising, e.g. Google Adsense
MSN, Ask, …
• Fraud detection: click fraud
• Specialized search: e.g. Froogle
detection, …
(comparison shopping), job ads • Improving Web site design and
(Flipdog)
performance
• eCommerce :
• Recommendations: e.g. Netflix,

Amazon
• improving conversion rate: next

best product to offer

May 12, 2014

Web Mining

Web Mining
• Web mining - data mining techniques to

automatically discover and extract information
from Web documents/services (Etzioni, 1996).
• Web mining research – integrate research from
several research communities (Kosala and
Blockeel, July 2000) such as:
• Database (DB)
• Information retrieval (IR)
• The sub-areas of machine learning (ML)
• Natural language processing (NLP)

24

5/12/2014

5/12/2014

Web Mining
• The World Wide Web may have more opportunities

for data mining than any other area
• However, there are serious challenges:
• It is too huge
• Complexity of Web pages is greater than any traditional

text document collection
• It is highly dynamic
• It has a broad diversity of users
• Only a tiny portion of the information is truly useful

How big is the Web ?

Technically,
infinite

Because of dynamically
generated content
Lots of duplication (30-40%)

Number
of pages

Best estimate of
“unique” static HTML
pages comes from
search engine claims

Google = 8 billion,
Yahoo = 20 billion
Lots of marketing
hype

Reproduced from Ullman & Rajaraman with permission

25

5/12/2014

Why Mine the Web?
• Enormous wealth of textual information on the Web.
• Book/CD/Video stores (e.g., Amazon)
• Restaurant information (e.g., Zagats)
• Car prices (e.g., Carpoint)

• Lots of data on user access patterns
• Web logs contain sequence of URLs accessed by users

• Possible to retrieve “previously unknown” information
• People who ski also frequently break their leg.
• Restaurants that serve sea food in California are likely to be outside

San-Francisco

In the May 2014, 975,262,468 sites
— 16 million more than last month

http://news.netcraft.com/archives/category/web-server-survey/

26

5/12/2014

Unique Features of the Web
• The Web is a huge collection of documents

where many contain:
• Hyper-link information
• Access and usage information

• The Web is very dynamic
• Web pages are constantly being generated (removed)

Challenge: Develop new Web mining algorithms to . . .
•Exploit hyper-links and access patterns.
•Be adaptable to its documents source

Web Mining vs Data Mining

Structure

• Web is not relation
• Textual information and linkage structure

Scale

• Usage data is huge and growing rapidly
• Data generated per day is comparable to
largest conventional data warehouses

Speed

• Often need to react to evolving usage
patterns in real-time (e.g., merchandising)
• No human in the loop

27

5/12/2014

May 12, 2014

Web Mining

Web Mining Taxonomy

Web Mining

Web Content
Mining

Web Structure
Mining

Web Usage
Mining

Web Mining Taxonomy
Web Mining

Web Content
Mining

Web Page
Content Mining
Identify information
within given web
pages

Web Structure
Mining

Search Result
Mining
Categorizes documents
using phrases in titles
and snippets

Uses interconnections between
web pages to give weight to
pages

Web Usage
Mining

General Access
Pattern Tracking
Understand access
patterns and trends to
improve structure

Customized
Usage Tracking
Analyzes access
patterns of a user to
improve response

Distinguish personal
home pages from
other web pages

28

5/12/2014

May 12, 2014

Web Mining

Mining the World Wide Web
Web Mining

Web Content
Mining

Web Structure
Web Usage
Mining
Mining

Web Page Content Mining
Web Page Summarization
WebOQL(Mendelzon et.al. 1998) …:
Customized
Web Structuring query languages; Search Result General Access
Pattern
Tracking
Usage
Tracking
Mining
Can identify information within
given web pages
•(Etzioni et.al. 1997):Uses heuristics to
distinguish personal home pages
from other web pages
•ShopBot (Etzioni et.al. 1997): Looks
for product prices within web
pages

May 12, 2014

Web Mining

Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining

Web Structure
Mining

Web Usage
Mining

Search Result Mining
Search Engine Result
Summarization
•Clustering Search Result
(Leouski and Croft, 1996,
Zamir and Etzioni, 1997):
Categorizes documents
using phrases in titles and
snippets

General Access Customized
Pattern Tracking Usage Tracking

29

5/12/2014

May 12, 2014

Web Mining

Mining the World Wide Web
Web Mining
Web Content
Mining

Web Usage
Mining
Web Structure Mining
Using Links
•PageRank (Brin et al., 1998)
•CLEVER (Chakrabarti et al., 1998)
Use interconnections between web pages
General Access
Search Result
to give weight to pages.
Pattern Tracking
Mining

Web Page
Content Mining

Using Generalization
•MLDB (1994)
Uses a multi-level database
representation of the Web. Counters
(popularity) and link lists are used for
capturing structure.

May 12, 2014

Customized
Usage Tracking

Web Mining

Mining the World Wide Web
Web Mining

Web Content
Mining

Web Page
Content Mining
Search Result
Mining

Web Structure
Mining

Web Usage
Mining

General Access Pattern Tracking
•Web Log Mining (Zaïane, Xin and
Han, 1998)
Uses KDD techniques to understand
general access patterns and trends.
Can shed light on better structure and
grouping of resource providers.

Customized
Usage Tracking

30

5/12/2014

May 12, 2014

Web Mining

Mining the World Wide Web
Web Mining

Web Content
Mining

Web Page
Content Mining

Web Structure
Mining

Web Usage
Mining

Customized Usage Tracking

General Access
Pattern Tracking

Search Result
Mining

•Adaptive Sites (Perkowitz and Etzioni,
1997)
Analyzes access patterns of each user at
a time.
Web site restructures itself automatically
by learning from user access patterns.

Web Content Mining Approaches
• Information Retrieval Approach
• To assist or to improve the information finding or

filtering the information to the users usually based
on either inferred or solicited user profiles.
• Database Approach
• To model the data on the Web and to integrated

them so that more sophisticated queries other than
the keywords based could be performed.

31

5/12/2014

5/12/2014

Web Content Mining
View of Data
Main Data
Representation

Methods
Applications

IR View
Unstructured
Semi-structured
Text documents
Hypertext documents
Bag of words, n-grams
Terms, phrases
Concepts or ontology
Relational
Machine Learning
Statistics
Categorization
Clustering
Finding extraction rules
Finding patterns in text
User modeling

May 12, 2014

DB View
Semi-structured
Web site as DB
Hypertext documents
Edge-labeled graph
Relational

ILP
Association rules
Finding frequent substructures
Web site schema discovery

Web Mining

Isu dalam Web Content Mining
• Pengembangan alat cerdar untuk IR
• Mencari kata kunci & frasa kunci
• Menemukan aturan gramatikal & collocation

• Klasifikasi/kategorisasi hyperteks
• Mengekstra frasa kunci dari dokumen html

• Ekstraksi model/aturan pembelajaran

• Hierarchical clustering
• Memprediksi keterhubungan kata

• Membangun web Query system (WebOQL, XMLQL)
• Mining multimedia data

32

5/12/2014

5/12/2014

Web Structure Mining
View of Data
Main Data
Representation
Methods
Applications

May 12, 2014

Links structure
Links structure
Graph
Proprietary algorithms
Categorization
Clustering

Web Mining

Web Structure Mining
• Untuk menemukan struktur link dari hyperlinks

pada level antardokumen untuk membangun
ringkasan struktur tentang situs web
• Arah 1: berbasis hyperlinks, mengkategorikan halaman

Web & informasi yang dibangun
• Arah 2: menemukan struktur dari dokumen web itu

sendiri
• Arah 3: menemukan kealamiahan hierarki/jaringan

hyperlinks pada situsweb tertentu

33

5/12/2014

May 12, 2014

Web Mining

Web Structure Mining
• Menemukan halaman web yg authorative
• Menemukembalikan halaman yang tidak hanya relevan, tapi

juga berkualitas tinggi/authorative terhadap topik
• Hyperlinks dapat merujuk authority
• Web menganfung juga hyperlinks dari satu halaman ke

halaman lain
• Hyperlinks mengandung anotasi manusia berjumlah besar
• Hyperlink yang merujuk ke halaman lain, dapat
dipertimbangkan sebagai kesukaan pengarang terhadap
halaman lain

5/12/2014

Web Usage Mining
View of Data
Main Data
Representation
Methods

Applications

Interactivity
Server logs
Browser logs
Relational table
Graph
Machine learning
Statistics
Association rules
Site construction, adaptation & management
Marketing
User modeling

34

5/12/2014

May 12, 2014

Web Mining

Web Usage Mining
• Web usage mining juga disebut Web

log mining
• Teknik mining untuk menemukan pola

penggunaan yang menarik dari data
sekunder yang diturunkan dari interaksi
pengguna ketika menjelajahi web

May 12, 2014

Web Mining

Web Usage Mining
• Aplikasi
• Menargetkan kostumer yang potensial untuk produk

elektronik
• Memperluas kualitas dan pengantaran Internet
Information Services kepada pengguna akhir.
• Memperbaiki performa sistem web server
• Mengidentifikasi lokasi iklan yang potensial
• Memfasilitasi personalisasi/situs adaptif
• Memperbaki desain situs
• Deteksi fraud/intrusion
• Memprediksi aksi pengguna

35

5/12/2014

May 12, 2014

Web Mining

May 12, 2014

Web Mining

Log Data - Simple Analysis
• Statistical analysis of users

– Length of path
– Viewing time
– Number of page views
• Statistical analysis of site

– Most common pages viewed
– Most common invalid URL

36

5/12/2014

May 12, 2014

Web Mining

Web Log – Data Mining Applications
• Association rules

– Find pages that are often viewed together
• Clustering

– Cluster users based on browsing patterns
– Cluster pages based on content
• Classification

– Relate user attributes to patterns

Common Log Format
• Remotehost: browser hostname or IP #
• Remote log name of user (almost

always "-" meaning "unknown")
• Authuser: authenticated username
• Date: Date and time of the request
• "request”: exact request lines from client
• Status: The HTTP status code returned
• Bytes: The content-length of response

37

5/12/2014

May 12, 2014

Web Mining

75

SERVER LOGS

May 12, 2014

Web Mining

Fields
• Client IP: 128.101.228.20
• Authenticated User ID: - • Time/Date: [10/Nov/1999:10:16:39 -0600]
• Request: "GET / HTTP/1.0"
• Status: 200
• Bytes: • Referrer: “-”
• Agent: "Mozilla/4.61 [en] (WinNT; I)"

38

5/12/2014

Searching the Web

The Web

Content aggregators

Content consumers

Reproduced from Ullman & Rajaraman with permission

Web search basics
Sponsored Links
CG Appliance Express
Discount Appliances (650) 756-3931
Same Day Certified Installation
www.cgappliance.com
San Francisco-Oakland-San Jose,
CA

User

Miele Vacuum Cleaners
Miele Vacuums- Complete Selection
Free Shipping!
www.vacuums.com
Miele Vacuum Cleaners
Miele-Free Air shipping!
All models. Helpful advice.
www.best-vacuum.com

Web

Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise
At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances.
Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam O ven. Coffee System ...
www.miele.com/ - 20k - Cached - Similar pages

Web crawler

Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this
page ]
Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit
...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes.
www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ]
Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch
weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ...
www.miele.at/ - 3k - Cached - Similar pages

Search

Indexer

The Web
Indexes

Ad indexes

Reproduced from Ullman & Rajaraman with permission

39

5/12/2014

Mining the Web
Web

Spider

Documents
source

Query

IR / IE
System

1. Doc1
2. Doc2
3. Doc3
.
.

Ranked
Documents

5/12/2014

Data Mining: Principles and Algorithms

Search Engine
Ranking based on link
structure analysis

Search

Rank Functions

Similarity
based on
content or text

Importance Ranking
(Link Analysis)

Relevance Ranking
Backward Link
(Anchor Text)

Indexer

Inverted
Index

Term Dictionary
(Lexicon)

Web Topology
Graph

Anchor Text
Generator

Meta Data

Forward
Index

Forward
Link

Web Graph
Constructor

URL
Dictioanry

Web Page Parser

Web Pages

40

5/12/2014

5/12/2014

Data Mining: Principles and Algorithms

Layout Structure
• Compared to plain text, a web page is a 2D presentation
• Rich visual effects created by different term types, formats,

separators, blank areas, colors, pictures, etc
• Different parts of a page are not equally important
Title: CNN.com International
H1: IAEA: Iran had secret nuke agenda
H3: EXPLOSIONS ROCK BAGHDAD

TEXT BODY (with position and font
type): The International Atomic Energy
Agency has concluded that Iran has
secretly produced small amounts of
nuclear materials including low enriched
uranium and plutonium that could be used
to develop nuclear weapons according to a
confidential report obtained by CNN…
Hyperlink:
• URL: http://www.cnn.com/...
• Anchor Text: AI oaeda…

Image:
•URL: http://www.cnn.com/image/...
•Alt & Caption: Iran nuclear …

Anchor Text: CNN Homepage News …

5/12/2014

Data Mining: Principles and Algorithms

Web Page Block—Better Information Unit
Web Page Blocks

Importance = Low

Importance = Med

Importance = High

41

5/12/2014

Web Usage Mining
Applications:
Simple and Basic:
• Monitor performance, bandwidth usage
• Catch errors (404 errors- pages not found)
• Improve web site design
• (shortcuts for frequent paths, remove links not used, etc)

Advanced and Business Critical :
• eCommerce: improve conversion, sales, profit
• Fraud detection: click stream fraud, …

Web Usage Mining – Three Phases

42

5/12/2014

Web Usage Mining Issues
• Identification of exact user not possible.
• Exact sequence of pages referenced by a user not

possible due to caching.
• Session not well defined
• Security, privacy, and legal issues

Systems Issues
Web data sets can
be very large

• Tens to hundreds of
terabytes

Cannot mine on a
single server!

• Need large farms of
servers

How to organize
hardware/software
to mine multiterabye data sets

• Without breaking the
bank!

43

5/12/2014

root

Ontology Learning

...
furnishing

event

area

accomodation
region ... city
hotel

... youth hostel

is-a
hierarchy

wellness hotel

Association
Rule Mining

Derived concept pairs
(wellness hotel, area)
(hotel, area)
(accomodation, area)

Generalized Conceptual
Relation
hasLocation(accomodation,area)

[Mädche, Staab: ECAI 2000]

Semantic Web Structure/Content Mining
Ontology

name

GolfCourse

FORALL X, Y
Y: Hotel[cooperatesWith ->> X] <X:ProjectHotel[cooperatesWith ->> Y].

Cooperat
es With

Organization
belongsTo
Hotel

Knowledge base
Hotel: Wellnesshotel
GolfCourse: Seaview
belongsTo(Seaview,
Wellnesshotel)

ILP Based
Association
Rule Mining,
eg. [Dehaspe,
Toivonen,
J. DMKD 1998]

...
Hotel(x), GolfCourse(y), belongsTo(y,x)  hasStars(x,5)
support = 0.4 %

confidence = 89 %

44

5/12/2014

5/12/2014

Complex Data Types Summary
• Emerging areas of mining complex data types:
• Text mining can be done quite effectively, especially if

the documents are semi-structured
• Web mining is more difficult due to lack of such

structure
• Data includes text documents, hypertext documents, link

structure, and logs
• Need to rely on unsupervised learning, sometimes

followed up with supervised learning such as classification

45

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close