Business Intelligence & Data Mining-11

Published on January 2017 | Categories: Documents | Downloads: 44 | Comments: 0 | Views: 325
of 37
Download PDF   Embed   Report

Comments

Content

Text Mining

Motivation for Text Mining
n

n

Approximately 90% of the world’s data is held in
unstructured formats (source: Oracle Corporation)
Information intensive business processes demand
that we transcend from data mining to “knowledge”
discovery in text.

10%

90%

Structured Numerical or Coded
Information

Unstructured or Semi-structured
Information

Text Databases
n

n

Large collections of documents from various
sources: news articles, research papers,
books, digital libraries, e-mail messages, and
Web pages, library databases
Properties
n

n
n

Unstructured in general (some semi-structured
e.g. XML)
Semantics, not only syntax, is important
Non-numeric in nature

Challenges of Text Mining
n

Very high number of possible “dimensions”
n

n

Unlike data mining:
n
n

n

records (= docs) are not structurally identical
records are not statistically independent

Complex and subtle relationships between
concepts in text
n
n

n

All possible word and phrase types in the language!!

“AOL merges with Time-Warner”
“Time-Warner is bought by AOL”

Ambiguity and context sensitivity
n
n

automobile = car = vehicle = Toyota
Apple (the company) or apple (the fruit)

Technological Advances in Text Mining
n

Advances in text processing technology
n
n

n

Natural Language Processing (NLP)
New algorithms

Cheap Hardware!
n
n
n

CPU
Disk
Network

Data Mining Vs Text Mining &
“Search” versus “Discover”

Structured
Data
Unstructured
Data (Text)

Search
(goal-oriented)

Discover

Data
Retrieval

Data
Mining

Document
Retrieval

Text
Mining

Text Databases and
Information Retrieval
n

Information / document retrieval
n

n

n

Traditional study of how to retrieve information
from text documents
Information is organized into (a large number of)
documents
Information retrieval problem: locating relevant
documents based on user input, such as
keywords, queries or example documents

Text Database and
Information Retrieval
n

Typical IR systems
n
n

n

Online library catalogs
Online document management systems

Information retrieval vs. database systems
n

n

Some DB problems are not present in IR, e.g.,
update, transaction management, complex
objects
Some IR problems are not addressed well in
DBMS, e.g., unstructured documents, approximate
search using keywords and relevance

Example of Text Mining Task
(Swanson, and Smalheiser, 1997)

n

Extract pieces of evidence from article titles in
the biomedical literature
n
n
n
n

n

“stress is associated with migraines”
“stress can lead to loss of magnesium”
“calcium channel-blockers prevent some migraines”
“magnesium is a natural calcium channel-blocker”

Induce a new hypothesis not in the literature
by combining culled text fragments with
human medical expertise
n

Magnesium deficiency may play a role in some kinds of
migraine headaches

What is Text Mining?
•Text Mining is the process of synthesizing information
by analyzing the relations, patterns, and rules within
textual data: semi-structured or unstructured text.
•The ‘Tool Box’
Data mining algorithms
Machine learning techniques
Document / information retrieval concepts
Statistical techniques
Natural-language processing

Potential Applications
•Customer comment analysis
•Trend analysis
•Information filtering and routing
•Event tracking
•News story classification
•Web search
•Sentiment analysis
•…….

Basic Measures: Precision & Recall

n

Precision: the percentage of retrieved documents that are in fact
relevant to the query (i.e., “correct” responses)

| {Relevant} ∩ {Retrieved} |
precision =
| {Retrieved} |
n

Recall: the percentage of documents that are relevant to the query
and were, in fact, retrieved

| {Relevant } ∩ {Retrieved } |
recall =
| {Relevant } |

Precision decreases as Recall
increases

Precision
SVM
Decision Tree
SOM
Logistic Regr

Recall

Dumais
(1998)

F-measure
n

F-measure = weighted harmonic mean of
precision and recall
(at some fixed operating threshold for the classifier)
F1 = 1/ ( 0.5/P + 0.5/R )
= 2PR / (P + R)
Useful as a simple single summary measure of
performance
Sometimes “breakeven” F 1 used, i.e., measured
when P = R

Challenges in text mining
n

Semantics:
n

n

Synonymy: The keywords T does not
appear anywhere in the document d, even
though the document d is closely related to
T (i.e. synonyms of T have been used in d)
Polysemy: The same keyword may mean
different things in different contexts, e.g.,
green (colour) Vs green initiatives

Challenges in text mining
n

Data pre-processing
n

Stop list: Set of words that are deemed “irrelevant”, even
though they may appear frequently
n
n

n

Word stem: Several words are small syntactic variants of
each other since they share a common word stem
n

n

E.g., a, the, of, for, with, etc.
Stop lists may vary when document set varies

E.g., drug, drugs, drugged

Phrases: Sometimes it is better to view a group of words as
a single unit (like a noun phrase)
n

E.g. : data mining, decision support

Feature Extraction

Feature Extraction
Task: Extract a good subset of words / phrases to represent documents
Document
collection
All unique
words/phrases
Feature
Extraction

All good
words/phrases

Feature Extraction: Example
While more and more textual information
is available online, effective retrieval is
difficult without good indexing of text
content.
14 words
While-more-textual-information-available-online-effectiveretrieval-difficult-without-good-indexing-text-content
Feature
Extraction
5 words
Text-information-online-retrieval-index
2
1
1
1
1

Stemming
n

Want to reduce all morphological variants of a word to a
single index term
n

n

e.g. a document containing words like fish and fisher may not be
retrieved by a query containing fishing (the word fishing not explicitly
contained in the document)

Stemming - reduce words to their root form
n

n

e.g. fish – becomes a new index term

Porter stemming algorithm (1980)
n

relies on a preconstructed suffix list with associated rules
n

e.g. if suffix=IZATION and prefix contains at least one vowel followed by a
consonant, replace with suffix=IZE
n

n

n

BINARIZATION => BINARIZE

Not always desirable: e.g., {university, universal} -> univers (in Porter’s)

WordNet: dictionary-based approach

Feature Extraction
Training
documents
Identification
all unique words
Removal
stop words




non-informative word
ex.{the,and,when,more}

•Removal

Word Stemming

of suffix to
generate word stem
•grouping words
• increasing the relevance
• ex.{walker,walking}→walk

Term Weighting

•Terms frequency
•Importance of term in Doc

Document Representation

Data Representation
n

Document vector / Frequency Matrix /
Bag of words (BOW)
n
n

n

Each document is represented by a vector
Each dimension of the vector is associated
with a word/term
For each document, the value of each
dimension is the frequency of the word
that exists in the document.

Term Frequency
•tf - Term Frequency weighting
wij = Freqij
= the number of times jth term
occurs in document Di.
× Drawback: does mot reflect the importance of the term
for document discrimination.
•Ex.
A B K O Q R S T W X

D1 ABRTSAQWA

XAO
D2

RTABBAXA
QSAK

D1

3

1

0

1

1

1 1

1

1

1

D2

3

2

1

0

1

1 1

1

0

1

Example of a document-term matrix
database

SQL

index

regression

likelihood

linear

d1

24

21

9

0

0

3

d2

32

10

5

0

3

0

d3

12

16

5

0

0

0

d4

6

7

2

0

0

0

d5

43

31

20

0

3

0

d6

2

0

0

18

7

16

d7

0

0

1

32

12

0

d8

3

0

0

22

4

2

d9

1

0

0

34

27

25

d10

6

0

0

17

4

23

TF-IDF
•tf×idf - Inverse Document Frequency
wij = Freqij * log(N/ DocFreqj) .
N = the number of documents in the training
document collection.
DocFreqj = the number of documents in the training collection
where the jth term occurs.
üAdvantage: has reflection of importance factor for
document discrimination.
Assumption: terms with low DocFreq are better discriminator
than ones with high DocFreq in document collection
•Ex.
A B K O Q R S T W X
D1

0

0

0 0.3 0

D2

0

0 0.3 0

0

0 0 0 0.3 0
0 0 0

0

0

Term Entropy
wij = log (FREQ ij + 1.0 ) * (1 + entropy ( w i ) )
where

N 
 FREQij 
1
FREQij
entropy ( wi ) =
log




log(N ) j =1  DOCFREQ j  DOCFREQ j 

is average entropy of jth term; it evaluates to:
-1: if every word occurs once in every document
0: if each word occurs once in only one document

Similarity Measures

Similarity measures
n

n

n

For various tasks, we need a measure of
similarity between documents
Eucledian distance is a common measure
Cosine similarity or dot-product is another
measure used in text mining:
v1 ⋅ v2
sim(v1 , v2 ) =
| v1 || v2 |
n
n

Focuses on co-occurrence of words
This corresponds to the cosine of the angle between
the two vectors

Text Mining Tasks

Text Categorization
•Task: assignment of one or more predefined
categories to one document.

Topics /
Themes

Text Categorization:Architecture
Training
documents

New document

preprocessing
Weighting
Selecting feature
Predefined
categories
Classifier

Category(ies) to d

d

Text Categorization Classifier Algorithms
§ SOM-based Classifier
§ k-Nearest Neighbor Classifier
§ Decision Tree Classifier
§ Logistic Regression
§ Neural Net Classifier
§ Bayesian Classifier
§ Support Vector Machines (SVM) Classifier

Document Clustering
•Task: It groups all documents so that the documents in the same
group are more similar than ones in other groups.
•Cluster hypothesis: relevant documents tend to be more
closely related to each other than to
non-relevant documents.

Document Clustering Algorithms
• k-means
• Hierarchic Agglomerative Clustering(HAC)
• Association Rule Hypergraph Partitioning (ARHP)
• SOM / ESOM based clustering

Latent Semantic Indexing
n

Weakness of keyword based techniques
n
n

n

Lack of semantics
Cannot identify similar word/concepts without help

Observation
n

n

Words/phrases that represent similar concepts are
usually grouped together
The most important unit of information for
documents may not be word, but concept instead

Latent Semantic Indexing
n

n
n

Latent Semantic Indexing is an attempt to
produce such information
Start with the term-frequency matrix M
Apply singular value decomposition on M
n
n
n

n

M=U*S*V
S = a diagonal matrix of eigenvalues
U = a square matrix in which each entry capture
similarity between documents
V = a square matrix in which each entry capture
similarity between terms

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close