Approximately 90% of the world’s data is held in
unstructured formats (source: Oracle Corporation)
Information intensive business processes demand
that we transcend from data mining to “knowledge”
discovery in text.
10%
90%
Structured Numerical or Coded
Information
Unstructured or Semi-structured
Information
Text Databases
n
n
Large collections of documents from various
sources: news articles, research papers,
books, digital libraries, e-mail messages, and
Web pages, library databases
Properties
n
n
n
Unstructured in general (some semi-structured
e.g. XML)
Semantics, not only syntax, is important
Non-numeric in nature
Challenges of Text Mining
n
Very high number of possible “dimensions”
n
n
Unlike data mining:
n
n
n
records (= docs) are not structurally identical
records are not statistically independent
Complex and subtle relationships between
concepts in text
n
n
n
All possible word and phrase types in the language!!
“AOL merges with Time-Warner”
“Time-Warner is bought by AOL”
Ambiguity and context sensitivity
n
n
automobile = car = vehicle = Toyota
Apple (the company) or apple (the fruit)
Technological Advances in Text Mining
n
Advances in text processing technology
n
n
n
Natural Language Processing (NLP)
New algorithms
Cheap Hardware!
n
n
n
CPU
Disk
Network
Data Mining Vs Text Mining &
“Search” versus “Discover”
Structured
Data
Unstructured
Data (Text)
Search
(goal-oriented)
Discover
Data
Retrieval
Data
Mining
Document
Retrieval
Text
Mining
Text Databases and
Information Retrieval
n
Information / document retrieval
n
n
n
Traditional study of how to retrieve information
from text documents
Information is organized into (a large number of)
documents
Information retrieval problem: locating relevant
documents based on user input, such as
keywords, queries or example documents
Text Database and
Information Retrieval
n
Typical IR systems
n
n
n
Online library catalogs
Online document management systems
Information retrieval vs. database systems
n
n
Some DB problems are not present in IR, e.g.,
update, transaction management, complex
objects
Some IR problems are not addressed well in
DBMS, e.g., unstructured documents, approximate
search using keywords and relevance
Example of Text Mining Task
(Swanson, and Smalheiser, 1997)
n
Extract pieces of evidence from article titles in
the biomedical literature
n
n
n
n
n
“stress is associated with migraines”
“stress can lead to loss of magnesium”
“calcium channel-blockers prevent some migraines”
“magnesium is a natural calcium channel-blocker”
Induce a new hypothesis not in the literature
by combining culled text fragments with
human medical expertise
n
Magnesium deficiency may play a role in some kinds of
migraine headaches
What is Text Mining?
•Text Mining is the process of synthesizing information
by analyzing the relations, patterns, and rules within
textual data: semi-structured or unstructured text.
•The ‘Tool Box’
Data mining algorithms
Machine learning techniques
Document / information retrieval concepts
Statistical techniques
Natural-language processing
F-measure = weighted harmonic mean of
precision and recall
(at some fixed operating threshold for the classifier)
F1 = 1/ ( 0.5/P + 0.5/R )
= 2PR / (P + R)
Useful as a simple single summary measure of
performance
Sometimes “breakeven” F 1 used, i.e., measured
when P = R
Challenges in text mining
n
Semantics:
n
n
Synonymy: The keywords T does not
appear anywhere in the document d, even
though the document d is closely related to
T (i.e. synonyms of T have been used in d)
Polysemy: The same keyword may mean
different things in different contexts, e.g.,
green (colour) Vs green initiatives
Challenges in text mining
n
Data pre-processing
n
Stop list: Set of words that are deemed “irrelevant”, even
though they may appear frequently
n
n
n
Word stem: Several words are small syntactic variants of
each other since they share a common word stem
n
n
E.g., a, the, of, for, with, etc.
Stop lists may vary when document set varies
E.g., drug, drugs, drugged
Phrases: Sometimes it is better to view a group of words as
a single unit (like a noun phrase)
n
E.g. : data mining, decision support
Feature Extraction
Feature Extraction
Task: Extract a good subset of words / phrases to represent documents
Document
collection
All unique
words/phrases
Feature
Extraction
All good
words/phrases
Feature Extraction: Example
While more and more textual information
is available online, effective retrieval is
difficult without good indexing of text
content.
14 words
While-more-textual-information-available-online-effectiveretrieval-difficult-without-good-indexing-text-content
Feature
Extraction
5 words
Text-information-online-retrieval-index
2
1
1
1
1
Stemming
n
Want to reduce all morphological variants of a word to a
single index term
n
n
e.g. a document containing words like fish and fisher may not be
retrieved by a query containing fishing (the word fishing not explicitly
contained in the document)
Stemming - reduce words to their root form
n
n
e.g. fish – becomes a new index term
Porter stemming algorithm (1980)
n
relies on a preconstructed suffix list with associated rules
n
e.g. if suffix=IZATION and prefix contains at least one vowel followed by a
consonant, replace with suffix=IZE
n
n
n
BINARIZATION => BINARIZE
Not always desirable: e.g., {university, universal} -> univers (in Porter’s)
WordNet: dictionary-based approach
Feature Extraction
Training
documents
Identification
all unique words
Removal
stop words
•
•
non-informative word
ex.{the,and,when,more}
•Removal
Word Stemming
of suffix to
generate word stem
•grouping words
• increasing the relevance
• ex.{walker,walking}→walk
Term Weighting
•Terms frequency
•Importance of term in Doc
Document Representation
Data Representation
n
Document vector / Frequency Matrix /
Bag of words (BOW)
n
n
n
Each document is represented by a vector
Each dimension of the vector is associated
with a word/term
For each document, the value of each
dimension is the frequency of the word
that exists in the document.
Term Frequency
•tf - Term Frequency weighting
wij = Freqij
= the number of times jth term
occurs in document Di.
× Drawback: does mot reflect the importance of the term
for document discrimination.
•Ex.
A B K O Q R S T W X
D1 ABRTSAQWA
XAO
D2
RTABBAXA
QSAK
D1
3
1
0
1
1
1 1
1
1
1
D2
3
2
1
0
1
1 1
1
0
1
Example of a document-term matrix
database
SQL
index
regression
likelihood
linear
d1
24
21
9
0
0
3
d2
32
10
5
0
3
0
d3
12
16
5
0
0
0
d4
6
7
2
0
0
0
d5
43
31
20
0
3
0
d6
2
0
0
18
7
16
d7
0
0
1
32
12
0
d8
3
0
0
22
4
2
d9
1
0
0
34
27
25
d10
6
0
0
17
4
23
TF-IDF
•tf×idf - Inverse Document Frequency
wij = Freqij * log(N/ DocFreqj) .
N = the number of documents in the training
document collection.
DocFreqj = the number of documents in the training collection
where the jth term occurs.
üAdvantage: has reflection of importance factor for
document discrimination.
Assumption: terms with low DocFreq are better discriminator
than ones with high DocFreq in document collection
•Ex.
A B K O Q R S T W X
D1
0
0
0 0.3 0
D2
0
0 0.3 0
0
0 0 0 0.3 0
0 0 0
0
0
Term Entropy
wij = log (FREQ ij + 1.0 ) * (1 + entropy ( w i ) )
where
is average entropy of jth term; it evaluates to:
-1: if every word occurs once in every document
0: if each word occurs once in only one document
Similarity Measures
Similarity measures
n
n
n
For various tasks, we need a measure of
similarity between documents
Eucledian distance is a common measure
Cosine similarity or dot-product is another
measure used in text mining:
v1 ⋅ v2
sim(v1 , v2 ) =
| v1 || v2 |
n
n
Focuses on co-occurrence of words
This corresponds to the cosine of the angle between
the two vectors
Text Mining Tasks
Text Categorization
•Task: assignment of one or more predefined
categories to one document.
Topics /
Themes
Text Categorization:Architecture
Training
documents
Text Categorization Classifier Algorithms
§ SOM-based Classifier
§ k-Nearest Neighbor Classifier
§ Decision Tree Classifier
§ Logistic Regression
§ Neural Net Classifier
§ Bayesian Classifier
§ Support Vector Machines (SVM) Classifier
Document Clustering
•Task: It groups all documents so that the documents in the same
group are more similar than ones in other groups.
•Cluster hypothesis: relevant documents tend to be more
closely related to each other than to
non-relevant documents.
Document Clustering Algorithms
• k-means
• Hierarchic Agglomerative Clustering(HAC)
• Association Rule Hypergraph Partitioning (ARHP)
• SOM / ESOM based clustering
Latent Semantic Indexing
n
Weakness of keyword based techniques
n
n
n
Lack of semantics
Cannot identify similar word/concepts without help
Observation
n
n
Words/phrases that represent similar concepts are
usually grouped together
The most important unit of information for
documents may not be word, but concept instead
Latent Semantic Indexing
n
n
n
Latent Semantic Indexing is an attempt to
produce such information
Start with the term-frequency matrix M
Apply singular value decomposition on M
n
n
n
n
M=U*S*V
S = a diagonal matrix of eigenvalues
U = a square matrix in which each entry capture
similarity between documents
V = a square matrix in which each entry capture
similarity between terms