Text Mining

Published on May 2016 | Categories: Documents | Downloads: 52 | Comments: 0 | Views: 374

of 19

test

Content

An Overview of Text Mining
Rebecca Hwa
4/25/2002
References
M. Hearst, “Untangling Text Data Mining ,” in the Proceedings of the 37th Annual Meeting of the
Association for Computational Linguistics, 1999.
E. Riloff and R. Jones, “
Learning Dictionaries for Information Extraction Using Multi-level Boot-strapping ,” in the
Proceedings of AAAI-99, 1999.
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “
Text Classification from Labeled and Unlabeled Documents using EM ,” in Machine Learning,
2000.
M. Grobelnik, D. Mladenic, and N. Milic-Frayling, “T
ext Mining as Integration of Several Related Research Areas: Report on KDD’2000 Workshop
on Text Mining
,” 2000.

What Is Text Mining?
“The objective of Text Mining is to exploit information
contained in textual documents in various ways,
including …discovery of patterns and trends in data,
associations among entities, predictive rules, etc.”
(Grobelnik et al., 2001)
“Another way to view text data mining is as a process of
exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known.” (Hearst,
1999)

Text Mining
• How does it relate to data mining in general?
• How does it relate to computational linguistics?
• How does it relate to information retrieval?
Finding Patterns
Non-textual data
Textual data

General
data-mining
Computational
Linguistics

Finding “Nuggets”
Novel

Non-Novel

Exploratory
Data
Analysis

Database
queries
Information
Retrieval

Challenges in Text Mining
• Data collection is “free text”
– Data is not well-organized
• Semi-structured or unstructured

– Natural language text contains ambiguities on
many levels
• Lexical, syntactic, semantic, and pragmatic

– Learning techniques for processing text
typically need annotated training examples
• Consider bootstrapping techniques

Text Mining Tasks
• Exploratory Data Analysis
– Using text to form hypotheses about diseases (Swanson
and Smalheiser, 1997).

• Information Extraction
– (Semi)automatically create (domain specific)
knowledge bases, and then use standard data-mining
techniques.
• Bootstrapping methods (Riloff and Jones, 1999).

• Text Classification
– Useful intermediary step for information extraction
• Bootstrapping method using EM (Nigam et al., 2000).

Biomedical Data Exploration
(Swanson, and Smalheiser, 1997)
• Extract pieces of evidence from article titles in the
biomedical literature
•
•
•
•

“stress is associated with migraines”
“stress can lead to loss of magnesium”
“calcium channel blockers prevent some migraines”
“magnesium is a natural calcium channel blocker”

• Induce a new hypothesis not in the literature by
combining culled text fragments with human
medical expertise
• Magnesium deficiency may play a role in some kinds of
migraine headache

Challenges in Data Exploration
• How can valid inference links be found
without succumbing to combinatorial
explosion of possibilities?
– Need better models of lexical relationships and
semantic constraints (very hard)

• How should the information be presented to
the human experts to facilitate their
exploration?

Information Extraction (IE)
• Extract domain-specific information from natural
language text
– Need a dictionary of extraction patterns (e.g.,
“traveled to <x>” or “presidents of <x>”)
• Constructed by hand
• Automatically learned from hand-annotated training data

– Need a semantic lexicon (dictionary of words with
semantic category labels)
• Typically constructed by hand

Challenges in IE
• Automatic learning methods are typically
supervised (i.e., need labeled examples)
• But annotating training data is a timeconsuming and expensive task.
• Can we develop better unsupervised
algorithm?
• Can we make better use of a small set of
labeled example?

Learning Dictionaries for IE via
Bootstrapping (Riloff and Jones, 1999)
• Simultaneously learn extraction patterns and
domain-specific semantic lexicons
• Input requires a small set of seed words (for the
semantic categories) and a large collection of text
• Mutual bootstrapping
– Learns extraction patterns from seed words
– Use extraction patterns to identify new words to add to
the semantic categories
– Meta-bootstrapping to reduce noise

Text classification (TC)
• Tag a document as belonging to one of a set
of pre-defined classes
– “This does not lead to discovery of new
information…” (Hearst, 1999).
– Many practical uses
• Group documents into different domains (useful for
domain specific information extraction)
• Learn reading interests of users
• Automatically sort e-mail
• On-line New Event Detection

Challenges in TC
• Like IE, also need lots of labeled examples
as training data
– After a user has labeled 1000 UseNet news
articles, the system was only right ~50% of the
time at selecting articles interesting to the user.

• What other sources of information can
reduce the need for labeled examples?

TC from Labeled and Unlabeled Documents
using EM (Nigam et al., 2000)
• Expectation-Maximization
– Iterative algorithm for MLE in parametric estimation
problems with missing data (e.g. the labels for the example)

• Nigam et al. combined the EM algorithm with a Naïve
Bayes classifier, using both labeled and unlabeled
data as input
– Dynamically adjust strength of unlabeled data’s
contribution to parameter estimation in EM
– Reduce the bias of naïve Bayes by modeling each class
with multiple mixture components

Probabilistic Framework for TC
• Assumption #1: Doc produced by mixture model
– Generate docs according to probability distribution defined by
the model parameters 

• Assumption #2: Each class is modeled by one mixture
component: C ={c1,…,c|C|}

Prob. of model generating doc di is:
|C |

P (d i |  )   P (c j |  ) P (d i | c j ; )
j 1

Naïve Bayes Model
• Assumes words in the document are generated
independently (no context)
• Assume all text have the same length
|C |

P (d i |  )   P(c j |  ) P (d i | c j ; )
j 1

P (d i | c j ; )  P( w1 ,..., w|di | | c j ; )
|d i |

  P ( wk | c j ; )
k 1

• Model parameters:

  { w|c , c }

Using a Trained Model
• What class should a new document d be
assigned to?
P (c |  ) P (d | c; )
P ( Label (d )  c | d ; ) 
P(d |  )

• Pick the class with the highest probability

Parameter Estimation with
Labeled Documents
• Estimating model parameters:

  { w|c , c }
 w|c

 # (w  d ) Ind ( Label (d )  c)
 P ( w | c; ) 
  # (w' d ' )Ind ( Label (d ' )  c)
d D

w 'V d 'D

 c  P (c |  ) 

 Ind ( Label (d )  c)

d D

|D|

Parameter Estimation with
Unlabeled Documents
• EM: for “incomplete data” problems
• Maximize prob. of model generating observed data
• Build initial classifier (initialize the parameters to
“reasonable” starting values)
• Repeat until convergence
– E-Step: Use current classifier params, t, to estimate P(c|d;t)
for all d in Du
– M-Step: Re-estimate the classifier, t+1, using the expected
counts from the E-Step

Augmented EM
• Weight the unlabeled data
– Otherwise, unlabeled data overwhelms the small
amount of labeled data
– Modify M-step to multiply expected counts with a
weight factor

• Relax the one class one mixture component
assumption
– Allow labeled data to fall into “topics” within a class
– Modify E-step to allow labeled document to
probabilistically belong to sub-topics

Text Mining

Comments

Content

Sponsor Documents

Recommended