Effective Pattern Discovery

Published on January 2017 | Categories: Documents | Downloads: 27 | Comments: 0 | Views: 147
of 15
Download PDF   Embed   Report

Comments

Content

30

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 1,

JANUARY 2012

Effective Pattern Discovery
for Text Mining
Ning Zhong, Yuefeng Li, and Sheng-Tang Wu
Abstract—Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to
effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most
existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the
years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based
ones, but many experiments do not support this hypothesis. This paper presents an innovative and effective pattern discovery
technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating
discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics
demonstrate that the proposed solution achieves encouraging performance.
Index Terms—Text mining, text classification, pattern mining, pattern evolving, information filtering.

Ç
1

INTRODUCTION

D

UE to

the rapid growth of digital data made available in
recent years, knowledge discovery and data mining
have attracted a great deal of attention with an imminent
need for turning such data into useful information and
knowledge. Many applications, such as market analysis and
business management, can benefit by the use of the
information and knowledge extracted from a large amount
of data. Knowledge discovery can be viewed as the process
of nontrivial extraction of information from large databases,
information that is implicitly presented in the data,
previously unknown and potentially useful for users. Data
mining is therefore an essential step in the process of
knowledge discovery in databases.
In the past decade, a significant number of data mining
techniques have been presented in order to perform
different knowledge tasks. These techniques include association rule mining, frequent itemset mining, sequential
pattern mining, maximum pattern mining, and closed
pattern mining. Most of them are proposed for the purpose
of developing efficient mining algorithms to find particular
patterns within a reasonable and acceptable time frame.
With a large number of patterns generated by using data
mining approaches, how to effectively use and update these
patterns is still an open research issue. In this paper, we
. N. Zhong is with the Department of Life Science and Informatics, Maebashi
Institute of Technology, Maebashi, 371-0816, Japan, and the International
WIC Institute, Beijing University of Technology, Beijing, 100124, P.R.
China. E-mail: [email protected].
. Y. Li is with the Discipline of Computer Science, Queensland University of
Technology, 126 Margaret St., Brisbane, QLD 4001, Australia.
E-mail: [email protected].
. S.-T. Wu is with the Department of Applied Informatics and Multimedia,
Asia University, 500 Liufeng Road, Wufeng, Taichung 41354, Taiwan.
E-mail: [email protected].
Manuscript received 18 Nov. 2008; revised 6 July 2009; accepted 14 June
2010; published online 26 Oct. 2010.
Recommended for acceptance by X. Zhu.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-2008-11-0610.
Digital Object Identifier no. 10.1109/TKDE.2010.211.
1041-4347/12/$31.00 ß 2012 IEEE

focus on the development of a knowledge discovery model
to effectively use and update the discovered patterns and
apply it to the field of text mining.
Text mining is the discovery of interesting knowledge in
text documents. It is a challenging issue to find accurate
knowledge (or features) in text documents to help users to
find what they want. In the beginning, Information
Retrieval (IR) provided many term-based methods to solve
this challenge, such as Rocchio and probabilistic models [4],
rough set models [23], BM25 and support vector machine
(SVM) [34] based filtering models. The advantages of termbased methods include efficient computational performance as well as mature theories for term weighting,
which have emerged over the last couple of decades from
the IR and machine learning communities. However, termbased methods suffer from the problems of polysemy and
synonymy, where polysemy means a word has multiple
meanings, and synonymy is multiple words having the
same meaning. The semantic meaning of many discovered
terms is uncertain for answering what users want.
Over the years, people have often held the hypothesis that
phrase-based approaches could perform better than the termbased ones, as phrases may carry more “semantics” like
information. This hypothesis has not fared too well in the
history of IR [19], [40], [41]. Although phrases are less
ambiguous and more discriminative than individual terms,
the likely reasons for the discouraging performance include:
1) phrases have inferior statistical properties to terms, 2) they
have low frequency of occurrence, and 3) there are large
numbers of redundant and noisy phrases among them [41].
In the presence of these set backs, sequential patterns
used in data mining community have turned out to be a
promising alternative to phrases [13], [50] because sequential patterns enjoy good statistical properties like terms. To
overcome the disadvantages of phrase-based approaches,
pattern mining-based approaches (or pattern taxonomy
models (PTM) [50], [51]) have been proposed, which
adopted the concept of closed sequential patterns, and
pruned nonclosed patterns. These pattern mining-based
approaches have shown certain extent improvements on the
Published by the IEEE Computer Society

ZHONG ET AL.: EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING

effectiveness. However, the paradox is that people think
pattern-based approaches could be a significant alternative,
but consequently less significant improvements are made
for the effectiveness compared with term-based methods.
There are two fundamental issues regarding the effectiveness of pattern-based approaches: low frequency and
misinterpretation. Given a specified topic, a highly frequent
pattern (normally a short pattern with large support) is
usually a general pattern, or a specific pattern of low
frequency. If we decrease the minimum support, a lot of
noisy patterns would be discovered. Misinterpretation
means the measures used in pattern mining (e.g., “support”
and “confidence”) turn out to be not suitable in using
discovered patterns to answer what users want. The
difficult problem hence is how to use discovered patterns
to accurately evaluate the weights of useful features
(knowledge) in text documents.
Over the years, IR has developed many mature techniques which demonstrated that terms were important
features in text documents. However, many terms with
larger weights (e.g., the term frequency and inverse
document frequency (tf*idf) weighting scheme) are general
terms because they can be frequently used in both relevant
and irrelevant information. For example, term “LIB” may
have larger weight than “JDK” in a certain of data
collection; but we believe that term “JDK” is more specific
than term “LIB” for describing “Java Programming Language”; and term “LIB” is more general than term “JDK”
because term “LIB” is also frequently used in C and C++.
Therefore, it is not adequate for evaluating the weights of
the terms based on their distributions in documents for a
given topic, although this evaluating method has been
frequently used in developing IR models.
In order to solve the above paradox, this paper presents an
effective pattern discovery technique, which first calculates
discovered specificities of patterns and then evaluates term
weights according to the distribution of terms in the
discovered patterns rather than the distribution in documents for solving the misinterpretation problem. It also
considers the influence of patterns from the negative training
examples to find ambiguous (noisy) patterns and try to
reduce their influence for the low-frequency problem. The
process of updating ambiguous patterns can be referred as
pattern evolution. The proposed approach can improve the
accuracy of evaluating term weights because discovered
patterns are more specific than whole documents.
We also conduct numerous experiments on the latest
data collection, Reuters Corpus Volume 1 (RCV1) and Text
Retrieval Conference (TREC) filtering topics, to evaluate the
proposed technique. The results show that the proposed
technique outperforms up-to-date data mining-based methods, concept-based models and the state-of-the-art termbased methods.
The rest of this paper is structured as follows: Section 2
discusses related work. Section 3 provides some definitions
about closed patterns, PTM and closed sequential patterns.
Sections 4 and 5 propose the techniques of pattern
deploying and inner pattern evolution (IPE) in PTM,
respectively. Section 6 presents experimental setting and
results for evaluating the proposed approach. Finally,
Section 7 gives concluding remarks.

31

2

RELATED WORK

Many types of text representations have been proposed in
the past. A well known one is the bag of words that uses
keywords (terms) as elements in the vector of the feature
space. In [21], the tf*idf weighting scheme is used for text
representation in Rocchio classifiers. In addition to TFIDF,
the global IDF and entropy weighting scheme is proposed
in [9] and improves performance by an average of
30 percent. Various weighting schemes for the bag of words
representation approach were given in [1], [14], [38]. The
problem of the bag of words approach is how to select a
limited number of features among an enormous set of
words or terms in order to increase the system’s efficiency
and avoid overfitting [41]. In order to reduce the number of
features, many dimensionality reduction approaches have
been conducted by the use of feature selection techniques,
such as Information Gain, Mutual Information, Chi-Square,
Odds ratio, and so on. Details of these selection functions
were stated in [19], [41].
The choice of a representation depended on what one
regards as the meaningful units of text and the meaningful
natural language rules for the combination of these units [41].
With respect to the representation of the content of documents, some research works have used phrases rather than
individual words. In [7], the combination of unigram and
bigrams was chosen for document indexing in text categorization (TC) and evaluated on a variety of feature evaluation
functions (FEF). A phrase-based text representation for Web
document management was also proposed in [44].
In [3], data mining techniques have been used for text
analysis by extracting cooccurring terms as descriptive
phrases from document collections. However, the effectiveness of the text mining systems using phrases as text
representation showed no significant improvement. The
likely reason was that a phrase-based method had “lower
consistency of assignment and lower document frequency
for terms” as mentioned in [18].
Term-based ontology mining methods also provided
some thoughts for text representations. For example, hierarchical clustering [28], [29] was used to determine synonymy and hyponymy relations between keywords. Also, the
pattern evolution technique was introduced in [25] in order to
improve the performance of term-based ontology mining.
Pattern mining has been extensively studied in data
mining communities for many years. A variety of efficient
algorithms such as Apriori-like algorithms [2], [31], [49],
PrefixSpan [32], [53], FP-tree [10], [11], SPADE [56], SLPMiner
[42], and GST [12] have been proposed. These research works
have mainly focused on developing efficient mining algorithms for discovering patterns from a large data collection.
However, searching for useful and interesting patterns and
rules was still an open problem [22], [24], [52]. In the field of
text mining, pattern mining techniques can be used to find
various text patterns, such as sequential patterns, frequent
itemsets, cooccurring terms and multiple grams, for building
up a representation with these new types of features.
Nevertheless, the challenging issue is how to effectively deal
with the large amount of discovered patterns.
For the challenging issue, closed sequential patterns have
been used for text mining in [51], which proposed that the
concept of closed patterns in text mining was useful and
had the potential for improving the performance of text
mining. Pattern taxonomy model was also developed in [50]

32

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 1,

JANUARY 2012

TABLE 2
Frequent Patterns and Covering Sets

and [51] to improve the effectiveness by effectively using
closed patterns in text mining. In addition, a two-stage
model that used both term-based methods and patternbased methods was introduced in [26] to significantly
improve the performance of information filtering.
Natural language processing (NLP) is a modern computational technology that can help people to understand the
meaning of text documents. For a long time, NLP was
struggling for dealing with uncertainties in human languages. Recently, a new concept-based model [45], [46] was
presented to bridge the gap between NLP and text mining,
which analyzed terms on the sentence and document levels.
This model included three components. The first component analyzed the semantic structure of sentences; the
second component constructed a conceptual ontological
graph (COG) to describe the sematic structures; and the last
component extracted top concepts based on the first two
components to build feature vectors using the standard
vector space model. The advantage of the concept-based
model is that it can effectively discriminate between
nonimportant terms and meaningful terms which describe
a sentence meaning. Compared with the above methods,
the concept-based model usually relies upon its employed
NLP techniques.

removed. Let min sup ¼ 50%, we can obtain ten frequent
patterns in Table 1 using the above definitions. Table 2
illustrates the ten frequent patterns and their covering sets.
Not all frequent patterns in Table 2 are useful. For
example, pattern ft3 ; t4 g always occurs with term t6 in
paragraphs, i.e., the shorter pattern, ft3 ; t4 g, is always a part
of the larger pattern, ft3 ; t4 ; t6 g, in all of the paragraphs.
Hence, we believe that the shorter one, ft3 ; t4 g, is a noise
pattern and expect to keep the larger pattern, ft3 ; t4 ; t6 g, only.
Given a termset X, its covering set -X- is a subset of
paragraphs. Similarly, given a set of paragraphs Y  P SðdÞ,
we can define its termset, which satisfies

-

A pattern X (also a termset) is called closed if and only if
X ¼ ClsðXÞ.
Let X be a closed pattern. We can prove that
supa ðX1 Þ < supa ðXÞ;

ð1Þ

for all patterns X1  X; otherwise, if supa ðX1 Þ ¼ supa ðXÞ,
we have
-X - ¼ -X- ;
1

where supa ðX1 Þ and supa ðXÞ are the absolute support of
pattern X1 and X, respectively.
We also have
-

-

ClsðXÞ ¼ termsetð -X- Þ ¼ termsetð -X1 - Þ  X1  X;
that is, ClsðXÞ 6¼ X.

-

-

-

-

3.1 Frequent and Closed Patterns
Given a termset X in document d, -X - is used to denote the
covering set of X for d, which includes all paragraphs dp 2
P SðdÞ such that X  dp, i.e., -X - ¼ fdpjdp 2 P SðdÞ; X  dpg.
Its absolute support is the number of occurrences of X in
P SðdÞ, that is supa ðXÞ ¼ j -X- j. Its relative support is the
fraction of the
-X- jparagraphs that contain the pattern, that is,
.
supr ðXÞ ¼ jPj SðdÞj
A termset X is called frequent pattern if its supr (or supa )
 min sup, a minimum support.
Table 1 lists a set of paragraphs for a given document d,
where P SðdÞ ¼ fdp1 ; dp2 ; . . . ; dp6 g, and duplicate terms were

ClsðXÞ ¼ termsetð -X- Þ:

-

In this paper, we assume that all documents are split into
paragraphs. So a given document d yields a set of paragraphs
P SðdÞ. Let D be a training set of documents, which consists
of a set of positive documents, Dþ ; and a set of negative
documents, D . Let T ¼ ft1 ; t2 ; . . . ; tm g be a set of terms (or
keywords) which can be extracted from the set of positive
documents, Dþ .

The closure of X is defined as follows:

-

PATTERN TAXONOMY MODEL

termsetðY Þ ¼ ftj8dp 2 Y ¼> t 2 dpg:

-

3

-

TABLE 1
A Set of Paragraphs

3.2 Pattern Taxonomy
Patterns can be structured into a taxonomy by using the
is-a (or subset) relation. For the example of Table 1, where
we have illustrated a set of paragraphs of a document,
and the discovered 10 frequent patterns in Table 2 if
assuming min sup ¼ 50%. There are, however, only three

ZHONG ET AL.: EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING

33

include more semantic meaning than terms that are
selected based on a term-based technique (e.g., tf*idf). As
a result, a term with a higher tf*idf value could be
meaningless if it has not cited by some d-patterns (some
important parts in documents). The evaluation of term
weights (supports) is different to the normal term-based
approaches. In the term-based approaches, the evaluation
of term weights are based on the distribution of terms in
documents. In this research, terms are weighted according
to their appearances in discovered closed patterns.
Fig. 1. Pattern taxonomy.

closed patterns in this example. They are <t3 ; t4 ; t6 >,
<t1 ; t2 >, and <t6 >.
Fig. 1 illustrates an example of the pattern taxonomy for
the frequent patterns in Table 2, where the nodes represent
frequent patterns and their covering sets; nonclosed
patterns can be pruned; the edges are “is-a” relation. After
pruning, some direct “is-a” retaliations may be changed, for
example, pattern ft6 g would become a direct subpattern of
ft3 ; t4 ; t6 g after pruning nonclosed patterns.
Smaller patterns in the taxonomy, for example pattern
ft6 g, (see Fig. 1) are usually more general because they could
be used frequently in both positive and negative documents;
and larger patterns, for example pattern ft3 ; t4 ; t6 g, in the
taxonomy are usually more specific since they may be used
only in positive documents. The semantic information will be
used in the pattern taxonomy to improve the performance of
using closed patterns in text mining, which will be further
discussed in the next section.

-

-

-

-

3.3 Closed Sequential Patterns
A sequential pattern s ¼ <t1 ; . . . ; tr > (ti 2 T ) is an ordered list
of terms. A sequence s1 ¼ <x1 ; . . . ; xi > is a subsequence of
another sequence s2 ¼ <y1 ; . . . ; yj >, denoted by s1 v s2 , iff
9j1 ; . . . ; jy such that 1  j1 < j2 . . . < jy  j and x1 ¼ yj1 ; x2 ¼
yj2 ; . . . ; xi ¼ yjy . Given s1 v s2 , we usually say s1 is a
subpattern of s2 , and s2 is a superpattern of s1 . In the
following, we simply say patterns for sequential patterns.
Given a pattern (an ordered termset) X in document d, -Xis still used to denote the covering set of X, which includes
all paragraphs ps 2 P SðdÞ such that X v ps, i.e., -X- ¼
fpsjps 2 P SðdÞ; X v psg. Its absolute support is the number of
occurrences of X in P SðdÞ, that is supa ðXÞ ¼ j -X - j. Its relative
support is the fraction of the paragraphs that contain the
-X- j
.
pattern, that is, supr ðXÞ ¼ jPj SðdÞj
A sequential pattern X is called frequent pattern if its
relative support (or absolute support)  min sup, a minimum support. The property of closed patterns (see eq. (1))
can be used to define closed sequential patterns. A frequent
sequential pattern X is called closed if not 9 any superpattern X1 of X such that supa ðX1 Þ ¼ supa ðXÞ.

4

PATTERN DEPLOYING METHOD

In order to use the semantic information in the pattern
taxonomy to improve the performance of closed patterns
in text mining, we need to interpret discovered patterns by
summarizing them as d-patterns (see the definition below)
in order to accurately evaluate term weights (supports).
The rational behind this motivation is that d-patterns

4.1 Representations of Closed Patterns
It is complicated to derive a method to apply discovered
patterns in text documents for information filtering systems. To simplify this process, we first review the
composition operation  defined in [25].
Let p1 and p2 be sets of term-number pairs. p1  p2 is
called the composition of p1 and p2 which satisfies
[
p1  p2 ¼ fðt; x1 þ x2 Þjðt; x1 Þ 2 p1 ; ðt; x2 Þ 2 p2 g
fðt; xÞjðt; xÞ 2 p1 [ p2 ; notððt; Þ 2 p1 \ p2 Þg;
where is the wild card that matches any number.
For the special case we have p  ; ¼ p; and the operands
of the composition operation are interchangeable. The result
of the composition is still a set of term-number pairs.
For example,
fðt1 ; 1Þ; ðt2 ; 2Þ; ðt3 ; 3Þg  fðt2 ; 4Þg ¼ fðt1 ; 1Þ; ðt2 ; 6Þ; ðt3 ; 3Þg;
or
fðt1 ; 2%Þ; ðt2 ; 5%Þ; ðt3 ; 9%Þg  fðt1 ; 1%Þ; ðt2 ; 3%Þg
¼ fðt1 ; 3%Þ; ðt2 ; 8%Þ; ðt3 ; 9%Þg:
Formally, for all positive documents di 2 Dþ , we first
deploy its closed patterns on a common set of terms T in
order to obtain the following d-patterns (deployed patterns,
nonsequential weighted patterns):
dbi ¼ fðti1 ; ni1 Þ; ðti2 ; ni2 Þ; . . . ; ðtim ; nim Þg;

ð2Þ

where tij in pair ðtij ; nij Þ denotes a single term and nij is its
support in di which is the total absolute supports given by
closed patterns that contain tij ; or nij (simply in this paper)
is the total number of closed patterns that contain tij .
For example, using Fig. 1 and Table 1, we have
supa ð<t3 ; t4 ; t6 >Þ ¼ 3;
supa ð<t1 ; t2 >Þ ¼ 3;
supa ð<t6 >Þ ¼ 5; and
db ¼ fðt1 ; 3Þ; ðt2 ; 3Þ; ðt3 ; 3Þ; ðt4 ; 3Þ; ðt6 ; 8Þg:
The process of calculating d-patterns can be easily
described by using the  operation in Algorithm 1 (PTM)
shown in Fig. 2 that will be described in the next section,
where a term’s support is the total number of closed patterns
that contain the term.
Table 3 illustrates a real example of pattern taxonomy for
a set of positive documents.
We also can obtain the d-patterns of the five sample
documents in Table 3 which are expressed as follows:

34

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 1,

JANUARY 2012

Fig. 2. Algorithm 1: PTM (Dþ , min_sup).

db1 ¼ fðcarbon; 2Þ; ðemiss; 1Þ; ðair; 1Þ; ðpollut; 1Þg;
db2 ¼ fðgreenhous; 1Þ; ðglobal; 2Þ; ðemiss; 1Þg;
db3 ¼ fðgreenhous; 1Þ; ðglobal; 1Þ; ðemiss; 1Þg;

for all pi 2 DP , where
fi
pi ¼ fðt1 ; f1 Þ; ðt2 ; f2 Þ; . . . ; ðtk ; fk Þg 2 DP ; wi ¼ Pk

j¼1 fj

db4 ¼ fðcarbon; 1Þ; ðair; 2Þ; ðantarct; 1Þg;
db5 ¼ fðemiss; 1Þ; ðglobal; 1Þ; ðpollut; 1Þg:
Let DP be a set of d-patterns in Dþ , and p 2 DP be a
d-pattern. We call pðtÞ the absolute support of term t,
which is the number of patterns that contain t in the
corresponding patterns taxonomies. In order to effectively
deploy patterns in different taxonomies from the different
positive documents, d-patterns will be normalized using
the following assignment sentence:
pðtÞ

 pðtÞ P

1

t2T

pðtÞ

:

Actually the relationship between d-patterns and terms
can be explicitly described as the following association
mapping [25], a set-value function:
 : DP ! 2T ½0;1
;
such that
ðpi Þ ¼ fðt1 ; w1 Þ; ðt2 ; w2 Þ; . . . ; ðtk ; wk Þg;

ð3Þ

and T ¼ ftjðt; fÞ 2 p; p 2 DP g.
ðpi Þ is called the normal form (or normalized d-pattern)
of d-pattern pi in this paper, and
termsetðpi Þ ¼ ft1 ; t2 ; . . . ; tk g:

4.2 D-Pattern Mining Algorithm
To improve the efficiency of the pattern taxonomy mining,
an algorithm, SP Mining, was proposed in [50] to find all
closed sequential patterns, which used the well-known
Apriori property in order to reduce the searching space.
Algorithm 1 (PTM) shown in Fig. 2 describes the
training process of finding the set of d-patterns. For every
positive document, the SPMining algorithm is first called
in step 4 giving rise to a set of closed sequential patterns
SP . The main focus of this paper is the deploying process,
which consists of the d-pattern discovery and term
support evaluation. In Algorithm 1 (Fig. 2), all discovered
patterns in a positive document are composed into a dpattern giving rise to a set of d-patterns DP in steps 6 to
9. Thereafter, from steps 12 to 19, term supports are

ZHONG ET AL.: EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING

35

0
TABLE 3
Example of a Set of Positive Documents Consisting
of Pattern Taxonomies

T hresholdðDP Þ ¼ min @
p2DP

X

1
supportðtÞA:

ð5Þ

ðt;wÞ2ðpÞ

A noise negative document nd in D is a negative
document that the system falsely identified as a positive,
that is weightðndÞ  T hresholdðDP Þ. In order to reduce the
noise, we need to track which d-patterns have been used to
give rise to such an error. We call these patterns offenders of
nd.
An offender of nd is a d-pattern that has at least one term
in nd. The set of offenders of nd is defined by:
ðndÞ ¼ fp 2 DP jtermsetðpÞ \ nd 6¼ ;g:

The number beside each sequential pattern indicates the absolute
support of pattern.

calculated based on the normal forms for all terms in dpatterns.
Let m ¼ jT j be the number of terms in T , n ¼ jDþ j be the
number of positive documents in a training set, K be the
average number of discovered patterns in a positive document, and k be the average number of terms in a discovered
pattern. We also assume that the basic operation is a
comparison between two terms.
The time complexity of the d-pattern discovery (from
steps 6 to 9) is OðKk2 nÞ. Step 10 takes OðmnÞ. Step 12 also
gets all terms from d-patterns and takes Oðm2 n2 Þ. Steps 13
to 15 initialize support function and take OðmÞ, and the
steps 16 to 20 take OðmnÞ. Therefore, the time complexity of
pattern deploying is
OðKk2 n þ mn þ m2 n2 þ m þ mnÞ ¼ OðKk2 n þ m2 n2 Þ:
After the supports of terms have been computed from
the training set, the following weight will be assigned to all
incoming documents d for deciding its relevance
X
supportðtÞðt; dÞ;
ð4Þ
weightðdÞ ¼
t2T

where supportðtÞ is defined in Algorithm 1 (Fig. 2); and
ðt; dÞ ¼ 1 if t 2 d; otherwise ðt; dÞ ¼ 0.

5

INNER PATTERN EVOLUTION

In this section, we discuss how to reshuffle supports of
terms within normal forms of d-patterns based on negative
documents in the training set. The technique will be useful
to reduce the side effects of noisy patterns because of the
low-frequency problem. This technique is called inner
pattern evolution here, because it only changes a pattern’s
term supports within the pattern.
A threshold is usually used to classify documents into
relevant or irrelevant categories. Using the d-patterns, the
threshold can be defined naturally as follows:

ð6Þ

There are two types of offenders: 1) a complete conflict
offender which is a subset of nd; and 2) a partial conflict
offender which contains part of terms of nd.
The basic idea of updating patterns is explained as follows:
complete conflict offenders are removed from d-patterns
first. For partial conflict offenders, their term supports are
reshuffled in order to reduce the effects of noise documents.
The main process of inner pattern evolution is implemented by the algorithm IPEvolving (see Algorithm 2 in
Fig. 3). The inputs of this algorithm are a set of d-patterns
DP , a training set D ¼ Dþ [ D . The output is a composed
of d-pattern. Step 2 in IPEvolving is used to estimate the
threshold for finding the noise negative documents. Steps 3
to 10 revise term supports by using all noise negative
documents. Step 4 is to find noise documents and the
corresponding offenders. Step 5 gets normal forms of dpatterns NDP. Step 6 calls algorithm Shuffling (see Algorithm 3 in Fig. 4) to update NDP according to noise
documents. Steps 7 to 9 compose updated normal forms
together.
The time complexity of Algorithm 2 in Fig. 3 is decided
by step 2, the number of calls for Shuffling algorithm and
the number of using  operation. Step 2 takes OðnmÞ. For
each noise negative pattern nd, the algorithm gets its
offenders that takes Oðnm jndjÞ in step 4, and then calls
once Shuffling. After that, it calls n  operation that takes
OðnmmÞ ¼ Oðnm2 Þ.
The task of algorithm Shuffling is to tune the support
distribution of terms within a d-pattern. A different strategy
is dedicated in this algorithm for each type of offender. As
stated in step 2 in the algorithm Shuffling, complete conflict
offenders (d-patterns) are removed since all elements
within the d-patterns are held by the negative documents
indicating that they can be discarded for preventing
interference from these possible “noises.”
The parameter offering is used in step 4 for the purpose of
temporarily storing the reduced supports of some terms in a
partial conflict offender. The offering is part of the sum of
supports of terms in a d-pattern where these terms also
appear in a noise document. The algorithm calculates the
base in step 5 which is certainly not zero since
termsetðpÞ  nd 6¼ ;; and then updates the support distributions of terms in step 6.
For example, for the following d-pattern
db ¼ fðt1 ; 3Þ; ðt2 ; 3Þ; ðt3 ; 3Þ; ðt4 ; 3Þ; ðt6 ; 8Þg:

36

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 1,

JANUARY 2012

Fig. 3. Algorithm 2: IPEvolving (Dþ , D , DP , ).

Fig. 4. Algorithm 3: Shuffling (nd, ðndÞ, NDP , . NDP ).

Its normal form is
fðt1 ; 3=20Þ; ðt2 ; 3=20Þ; ðt3 ; 3=20Þ; ðt4 ; 3=20Þ; ðt6 ; 2=5Þg:
Assume nd ¼ ft1 ; t2 ; t6 ; t9 g, db will be a partial conflict
offender since
b \ nd ¼ ft1 ; t2 ; t6 g 6¼ ;:
termsetðdÞ
3
3
7
þ 20
þ 25Þ ¼ 20
, and base ¼
Let  ¼ 2, offering ¼ 12 ð20
3
3
þ 20
¼ 10
. Hence, we can get the following updated
normal form by using algorithm Shuffling:
3
20

fðt1 ; 3=40Þ; ðt2 ; 3=40Þ; ðt3 ; 13=40Þ; ðt4 ; 13=40Þ; ðt6 ; 1=5Þg:

Let m ¼ jT j, n ¼ jDþ j the number of positive documents
in a training set, and q be the number of noise negative
documents in D . The time complexity of algorithm
Shuffling is decided by steps 6 to 9. For a given noise
negative document nd, its time complexity is Oðnm2 Þ if let
nd ¼ nd \ T , where T ¼ ft 2 termsetðpÞjp 2 DP g. Hence,
the time complexity of algorithm Shuffling is Oðnm2 Þ for a
given noise negative document.
Based on the above analysis about Algorithms 2 and 3,
the total time complexity of the inner pattern evolution is
Oðnm þ qðnmjndj þ nm2 Þ þ nm2 Þ ¼ Oðqnm2 Þ considering
that the noise negative document nd can be replaced by
nd \ T before conducting the pattern evolution.

ZHONG ET AL.: EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING

The proposed model includes two phases: the training
phase and the testing phase. In the training phase, the
proposed model first calls Algorithm PTM (Dþ , min sup) to
find d-patterns in positive documents (Dþ ) based on a
min sup, and evaluates term supports by deploying dpatterns to terms. It also calls Algorithm IPEvolving (Dþ ,
D , DP , ) to revise term supports using noise negative
documents in D based on an experimental coefficient . In
the testing phase, it evaluates weights for all incoming
documents using eq. (4). The incoming documents then can
be sorted based on these weights.

6

EVALUATION AND DISCUSSION

In this study, Reuters text collection is used to evaluate the
proposed approach. Term stemming and stopword removal
techniques are used in the prior stage of text preprocessing.
Several common measures are then applied for performance evaluation and our results are compared with the
state-of-art approaches in data mining, concept-based, and
term-based methods.

6.1 Experimental Data Set
The most popular used data set currently is RCV1, which
includes 806,791 news articles for the period between
20 August 1996 and 19 August 1997. These documents
were formatted by using a structured XML schema. TREC
filtering track has developed and provided two groups of
topics (100 in total) for RCV1 [37]. The first group includes
50 topics that were composed by human assessors and the
second group also includes 50 topics that were constructed
artificially from intersections topics. Each topic divided
documents into two parts: the training set and the testing
set. The training set has a total amount of 5,127 articles and
the testing set contains 37,556 articles. Documents in both
sets are assigned either positive or negative, where
“positive” means the document is relevant to the assigned
topic; otherwise “negative” will be shown.
All experimental models use “title” and “text” of XML
documents only. The content in “title” is viewed as a
paragraph as the one in “text” which consists of paragraphs.
For dimensionality reduction, stopword removal is applied
and the Porter algorithm [33] is selected for suffix stripping.
Terms with term frequency equaling to one are discarded.
6.2 Measures
Several standard measures based on precision and recall are
used. The precision is the fraction of retrieved documents
that are relevant to the topic, and the recall is the fraction of
relevant documents that have been retrieved.
The precision of first K returned documents top-K is also
adopted in this paper. The value of K we use in the
experiments is 20. In addition, the breakeven point (b=p) is
used to provide another measurement for performance
evaluation. It indicates the point where the value of
precision equals to the value of recall for a topic. The
higher the figure of b=p, the more effective the system is.
The b=p measure has been frequently used in common
information retrieval evaluations.
In order to assess the effect involving both precision and
recall, another criterion that can be used for experimental

37

evaluation is F -measure [20], which combines precision
and recall and can be defined by the following equation:
F -measure ¼

ð 2 þ 1Þ precision recall
;
 2 precision þ recall

ð7Þ

where  is a parameter giving weights of precision and
recall and can be viewed as the relative degree of
importance attributed to precision and recall [41]. A value
 ¼ 1 is adopted in our experiments meaning that it
attributes equal importance to precision and recall. When
 ¼ 1, the measure is expressed as:
F1 ¼

2 precision recall
:
precision þ recall

ð8Þ

The value of F¼1 is equivalent to the b=p when precision
equals to recall. However, the b=p cannot be compared
directly to the F¼1 value since the latter is given a higher
score than that of the former [54]. It has also been stated in [30]
that the F¼1 measure is greater or equal to the value of b=p.
Both the b=p and F -measure are the single-valued
measures in that they only use a figure to reflect the
performance over all the documents. However, we need
more figures to evaluate the system as a whole. Hence,
another measure, Interpolated Average Precision (IAP) is
introduced and has been adopted before in several research
works [17], [43], [54]. This measure is used to compare the
performance of different systems by averaging precisions at
11 standard recall levels (i.e., recall ¼ 0:0; 0:1; . . . ; 1:0). The
11-points measure is used in our comparison tables
indicating the first value of 11 points where recall equals
to zero. Moreover, Mean Average Precision (MAP) is used
in our evaluation which is calculated by measuring
precision at each relevance document first, and averaging
precisions over all topics.

6.3 Baseline Models
In order to make a comprehensive evaluation, we choose
three classes of models as the baseline models. The first
class includes several data mining-based methods that we
have introduced in Section 3. In the following, we
introduce other two classes: the concept-based model and
term-based methods.
6.3.1 Concept-Based Models
A new concept-based model was presented in [45] and [46],
which analyzed terms on both sentence and document levels.
This model used a verb-argument structure which split a
sentence into verbs and their arguments. For example, “John
hits the ball,” where “hits” is a verb, and “John” or “the ball”
are the arguments of “hits.” Arguments can be further
assigned labels such as subjects or objects (or theme).
Therefore, a term can be extended and to be either an
argument or a verb, and a concept is a labeled term.
For a document d, tfðcÞ is the number of occurrences of
concept c in d; and ctfðcÞ is called the conceptual term
frequency of concept c in a sentence s, which is the number
of occurrences of concept c in the verb-argument structure
of sentence s. Given a concept c, its tf and ctf can be
normalized as tfweight ðcÞ and ctfweight ðcÞ, and its weight can
be evaluated as follows:

38

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

weightðcÞ ¼ tfweight ðcÞ þ ctfweight ðcÞ:
To have a uniform representation, in this paper, we call a
concept as a concept-pattern which is a set of terms. For
example, verb “hits” is denoted as fhitsg and its argument
“the ball” is denoted as fthe; ballg.
It is complicated to construct a COG. Also, up to now, we
have not found any work for constructing COG for
describing semantic structures for a set of documents rather
than for an individual document for information filtering.
In order to give a comprehensive evaluation for comparing
the proposed model with the concept-based model, in this
paper, we design a concept-based model (CBM) for
describing the features in a set of positive documents,
which consists of two steps. The first step is to find all of the
concepts in the positive documents of the training set,
where verbs are extracted from PropBank data set at
http://verbs.colorado.edu/verb-index/propbank-1.0.tar.
gz. The second step is to use the deploying approach to
evaluate the weights of terms based on their appearances in
these discovery concepts. Unlike the proposed model,
which uses 4,000 features at most, the concept-based model
uses all features for each topic. Let CPi be the set of
concepts in di 2 Dþ . To synthesize both tf and ctf of
concepts in all positive documents, we use the following
equation to evaluate term weights
þ

jD j
X
jfcjc 2 CPi ; t 2 cgj
P
;
W ðtÞ ¼
c2CPi jcj
i¼1

ð9Þ

for all t 2 T .
We also designed another kind of the concept-based
model, called CBM Pattern Matching, which evaluates a
document d’s relevance by accumulating the weights of
concepts that appear in d as follows:
X
weightðdÞ ¼
weightðcÞ:
ð10Þ

W ðtÞ ¼

1 X d~
1 X d~
 
;
þ
~
~
jD j ~ þ kdk
jD j ~  kdk

JANUARY 2012

T F ðk1 þ 1Þ

k1 ðð1  bÞ þ b AVDL
DLÞ þ T F
ðr þ 0:5Þ=ðn  r þ 0:5Þ
;
log
ðR  r þ 0:5Þ=ðN  n  R þ r þ 0:5Þ

ð13Þ

where T F is the term frequency; k1 and b are the parameters;
DL and AV DL are the document length and average
document length. The values of k1 and b are set as 1.2 and
0.75, respectively, according to the suggestion in [47] and [48].
The SVM model is also a well-known learning method
introduced by Cortes and Vapnik [8]. Since the works of
Joachims [15], [16], researchers have successfully applied
SVM to many related tasks and presented some convincing
results [5], [6], [27], [39], [55]. The decision function in SVM
is defined as

þ1; if ðW x þ bÞ > 0;
hðxÞ ¼ signðW x þ bÞ ¼
ð14Þ
1; else;
where x is the input space; b 2 R is a threshold and


l
X

yi i xi ;

i¼1

for the given training data
ðxi ; yi Þ; . . . ; ðxl ; yl Þ;

6.3.2 Term-Based Methods
There are many classic term-based approaches. The Rocchio
algorithm [36], which has been widely adopted in information retrieval, can build text representation of a training set
using a Centroid ~
c as follows:

d2D

NO. 1,

frequency T F ðd; tÞ is the number of times that term t
occurs in document dðd 2 DÞ (D is a set of documents in
the data set); DF ðtÞ is the document frequency which is
the number of documents that contain term t; and IDF ðtÞ
is the inverse document frequency.
Another well-known term-based model is the BM25
approach, which is basically considered the state-of-the-art
baseline in IR [35]. The weight of a term t can be estimated
by using the following function:

c2d

~
c¼

VOL. 24,

ð15Þ

n

where xi 2 R and yi equals þ1 (1), if document xi is
labeled positive (negative). i 2 R is the weight of the
training example xi and satisfies the following constraints
8i : i  0;

and

l
X

i yi ¼ 0:

ð16Þ

i¼1

ð11Þ

d2D

where  and  are empirical parameters; Dþ and D are the
sets of positive and negative documents, respectively; d~
denotes a document.
Probabilistic methods (Prob) are well-known term-based
approaches. The following is the best one:



r þ 0:5
n  r þ 0:5
; ð12Þ
W ðtÞ ¼ log
R  r þ 0:5 ðN  nÞ  ðR  rÞ þ 0:5Þ
where N and R are the total number of documents and the
number of positive documents in the training set, respectively; n is the number of documents which contain t; and r
is the number of positive documents which contain t.
In addition, TFIDF is also widely used. The term t can
be weighted by W ðtÞ ¼ T F ðd; tÞ IDF ðtÞ, where term

Since all positive documents are treated equally before the
process of document evaluation, the value of i is set as 1.0 for
all of the positive documents and thus the i value for the
negative documents can be determined by using (13).
In document evaluation, once the concept for a topic is
obtained, the similarity between a test document and the
concept is estimated using inner product. The relevance of a
document d to a topic can be calculated by the function
RðdÞ ¼ d~ ~
c, where d~ is the term vector of d and ~
c is the
concept of the topic.
For both term-based models and CBM, we use the
following equation to assign weights for all incoming
documents d based on their corresponding W functions
X
W ðtÞðt; dÞ:
weightðdÞ ¼
t2T

ZHONG ET AL.: EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING

TABLE 4
The List of Methods Used for Evaluation

6.4 Hypotheses
The major objective of the experiments is to show how the
proposed approach can help improving the effectiveness
of pattern-based approaches. Hence, to give a comprehensive investigation for the proposed model, our experiments involve comparing the performance of different
pattern-based models, concept-based models, and termbased models.
In the experiments, the proposed model is evaluated in
term of the following hypothesis:
Hypothesis H1. The proposed model, PTM (IPE), is
designed to achieve the high performance for determining relevant information to answer what users
want. The model would be better than other patternbased models, concept-based models, and state-ofthe-art term-based models in the effectiveness.
. Hypothesis H2. The proposed deploying method has
better performance for the interpretation of discovered patterns in text documents. This deploying
approach is not only promising for pattern-based
approaches, but also significant for the conceptbased model.
In order to compare the proposed approach with others,
the baseline models are grouped into three categories as
mentioned the above. The first category contains all data
mining-based (DM) methods, such as sequential pattern
mining, sequential closed pattern mining, frequent itemset
mining, frequent closed itemset mining, where min sup ¼
0.2. The second category includes the concept-based model
that uses the deploying method and the CBM Pattern
.

39

TABLE 5
Comparison of All Methods on the First 50 Topics

Matching model; and the last category includes nGram,
Rocchio, Probabilistic model, TFIDF, and two state-of-theart models, BM25 and SVM. A brief of these methods is
depicted in Table 4.

6.5 Experimental Results
This section presents the results for the evaluation of the
proposed approach PTM (IPE), inner pattern evolving in the
pattern taxonomy model. The results of overall comparisons
are presented in Table 5, and the summarized results are
described in Fig. 5. We list the result obtained based only on
the first 50 TREC topics in Table 5 since not all methods can
complete all tasks in the last 50 TREC topics. As aforementioned, itemset-based data mining methods struggle in some
topics as too many candidates are generated to be processed.
In addition, results obtained based on the first 50 TREC
topics are more practical and reliable since the judgment for
these topics is manually made by domain experts, whereas
the judgment for the last 50 TREC topics is created based on
the metadata tagged in each document.
The most important information revealed in this table is
that our proposed PTM (IPE) outperforms not only the
pattern mining-based methods, but also the term-based
methods including the state-of-the-art methods BM25 and
SVM. PTM (IPE) also outperforms CBM Pattern Matching
and CBM in the five measures. CBM outperforms all other
models for the first 50 topics. For the time complexity in the
testing phase, all models take OðjT j jdjÞ for all incoming
documents d. In our experiments, all models used 702 terms
for each topic in average. Therefore, there is no significant
difference between these models on time complexity in the
testing phase.

40

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 1,

JANUARY 2012

Fig. 7. Comparison in the number of patterns used for training by each
method on the first 50 topics (r101
r150) and the rest of the topics
(r151
r200).
Fig. 5. Comparison of PTM (IPE) and other major models in five
measures for the 100 topics.

TABLE 6
Performance of Inner Pattern Evolving
in PTM on All Topics

Fig. 8. Comparison of PTM (IPE) and TFIDF in top-20 precision.

Ratio ¼

jfdjd 2 D ; weightðdÞ  thresholdðDP Þgj
:
jDþ j þ jD j

ð17Þ

Fig. 6 illustrates the relationship of the improvement as
inner evolving is applied and the abovementioned value of
Ratio. As we can see that the degree of improvement is in
direct proportion to the score of Ratio. That means the more
qualified negative documents are detected for concept
revision, the more improvement we can achieve. In other
words, the expected result can be achieved by using the
proposed approach.

Fig. 6. The relationship between the proportion in number of negative
documents greater than threshold to all documents and corresponding
improvement on IPE with  ¼ 5 on improved topics.

6.6

Discussion

6.6.1 PDM to IPE
Table 6 depicts the figures of evaluating measures achieved
by inner pattern evolving methods (IPE) and pure pattern
deploying method (PDM) on all RCV1 topics. As we can see
from the table the evolving method (IPE) outperforms PDM
in all measures.
In order to evaluate the effectiveness of PTM (IPE), we
attempt to find the correlation between the achieved
improvement and the parameter, denoting the ratio of the
number of negative documents greater than the threshold to
the number of all documents. This value can be obtained
using the following equation:

6.6.2 PTM (IPE) versus Other Models
The number of patterns used for training by each method is
shown in Fig. 7. The total number of patterns is estimated by
accumulating the number for each topic. As a result, the
figure shows PTM (IPE) is the method that utilizes the least
amount of patterns for concept learning compared to others.
This is because the efficient scheme of pattern pruning is
applied to the PTM (IPE) method. Nevertheless, the classic
methods such as Rocchio, Prob, and TFIDF adopt terms as
patterns in the feature space, they use much more patterns
than the proposed PTM (IPE) method and slightly less than
the sequential closed pattern mining method. Particularly,
nGram and the concept-based models are the methods with
the lowest performance which requires more than 15,000 patterns for concept learning. In addition, the total number of
patterns obtained based on the first 50 topics is almost the
same as the number obtained based on the last 50 topics for
all methods except PTM (IPE). The figure based on the first
topics group (r101
r150) for PTM (IPE) is less than that
based on the other group (r151
r200). This can be
explained in that the high proportion of closed patterns is
obtained by using PTM (IPE) based on the first topics group.

ZHONG ET AL.: EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING

Fig. 9. Comparing PTM (IPE) with Data Mining methods on the first
50 TREC topics.

A further investigation in the comparison of PTM (IPE)
and TFIDF in top-20 precision on all RCV1 topics is depicted
in Fig. 8. It is obvious that PTM (IPE) is superior to TFIDF
as it can be seen that positive results distribute over all
topics, especially for the first 50 topics. Another observation
is the scores on the first 50 topics are better than those on
the last fifty. That is because of the different ways of
generating these two sets of topics, which has been
mentioned before. The interesting behavior is that there
are a few topics where TFIDF outperforms PTM. After
further investigation, we found a similar characteristic of
these topics in that there are only a few positive examples
available in these topics. For example, topic r157, which is
the worst case for PTM (IPE) compared to TFIDF, has only
three positive documents available. Note that the average
number of positive documents for each topic is over 12. The
similar behaviors are found in topics r134 and r144.
The plotting of precisions on 11 standard points for PTM
(IPE) and pattern mining based methods on the first 50 topics
is illustrated in Fig. 9. The result supports the superiority of
the PTM (IPE) method and highlights the importance of the
adoption of proper pattern deploying and pattern evolving
methods to a pattern-based knowledge discovery system.
Comparing their performance at the first few points around
the low-recall area, it is also found that the points for pattern
mining methods drop rapidly as the recall value rises and
then keep a relatively gradual slope from the mid recall
period to the end. All four pattern mining methods achieve
similar results. However, the plotting curve for PTM (IPE) is
much smoother than those for pattern mining methods as
there is no severe fluctuation on it. Another observation on
this figure is that the pattern mining-based methods however
perform well at the point where recall is close to zero, despite
the overall unpromising results they have. Accordingly, we
can conclude that the pattern mining-based methods can
improve the performance in the low-recall situation.
Although the PTM (IPE) is equipped with the pattern
mining algorithm for discovering sequential closed patterns,
the promising results cannot be produced without the help

41

Fig. 10. Comparing PTM (IPE) with concept-based models on the all
100 TREC topics.

from the successful application of the proposed d-patterns
and inner pattern evolving. The proper usage of d-patterns,
which has been proven previously, can overcome the
misinterpretation problem and provide a feasible solution
to effectively exploit the vast amount of patterns generated
by data mining algorithms. Moreover, the employment of
IPE provides the mechanism to utilize the information from
negative examples to overcome the low-frequency problem.
In conclusion, the experimental results provide evidences
showing that the PTM (IPE) method is an ideal model for
further developing pattern mining based approaches.
As mentioned in the last section, PTM (IPE) is outperforms CBM Pattern Matching and CBM in all five measures
and CBM outperforms all other models for the first 50 topics.
It looks that the concept-based model has the promising
potential for improving the performance of text mining in the
future. Fig. 10 shows the plotting of precision on 11 standard
points for PTM (IPE), CBM, and CBM Pattern Matching. It
also shows that the deploying approach for using concepts to
answer what users want is significant for the concept-based
model because CBM is much better than the CBM Pattern

Fig. 11. Comparing PTM (IPE) with term-based methods on the first
50 TREC topics.

42

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Matching model. In general, the PTM (IPE) method outperforms CBM in these experiments.
Fig. 11 presents the plotting of precisions at 11 standard
points for PTM (IPE) and term-based methods on the first 50
topics. Compared to the previous plotting in Fig. 9, the
difference of performance for all methods is easier to be
recognized in the figure. Again, the PTM (IPE) method
outperforms all other methods. Among these methods, the
nGram method achieves a noticeable score of precision at
the first point where recall equals to zero, meaning that the
nGram method is able to promote top relevant documents
toward the front of the ranking list. As mentioned before,
data mining-based methods can perform well at the lowrecall area, which can explain why nGram has better results
at this point. However, the scores for the nGram method
drop rapidly at the following couple of points. During that
period, SVM, BM25, Rocchio, and Prob methods transcend
the nGram method and keep the superiority until the last
point where recall equals to 1. There is no doubt that the
lowest performance is produced by the TFIDF method,
which outperforms the nGram method only at the last few
recall points. In addition, the Prob method is superior to
the nGram method, but inferior to the Rocchio method. The
overall performance of Rocchio is better than that for the
Prob method which corresponds to the finding in [50].
In summary, the proposed approach PTM (IPE) achieves
an outstanding performance for text mining by comparing
with the up-to-date data mining-based methods, the
concept models, and the well-known term-based methods,
including the state-of-the-art BM25 and SVM models. The
results show the PTM (IPE) model can produce encouraging
gains in effectiveness, in particular over the SVM and CBM
models. These results strongly support Hypothesis H1. The
promising results can be explained in that the use of the
deploying method is promising (Hypothesis H2 is also
supported) for solving the misinterpretation problem
because it can combine well with the advantages of terms
and discovered patterns or concepts. Moreover, the inner
pattern deploying strategy provides an effective evaluation
for reducing the side effects of noisy patterns because the
estimation of term weights in the term space is based on not
only terms’ statistical properties but also patterns’ associations in the corresponding pattern taxonomies.

7

CONCLUSIONS

Many data mining techniques have been proposed in the last
decade. These techniques include association rule mining,
frequent itemset mining, sequential pattern mining, maximum pattern mining, and closed pattern mining. However,
using these discovered knowledge (or patterns) in the field
of text mining is difficult and ineffective. The reason is that
some useful long patterns with high specificity lack in
support (i.e., the low-frequency problem). We argue that not
all frequent short patterns are useful. Hence, misinterpretations of patterns derived from data mining techniques lead
to the ineffective performance. In this research work, an
effective pattern discovery technique has been proposed to

VOL. 24,

NO. 1,

JANUARY 2012

overcome the low-frequency and misinterpretation problems for text mining. The proposed technique uses two
processes, pattern deploying and pattern evolving, to refine
the discovered patterns in text documents. The experimental
results show that the proposed model outperforms not only
other pure data mining-based methods and the conceptbased model, but also term-based state-of-the-art models,
such as BM25 and SVM-based models.

ACKNOWLEDGMENTS
This paper was partially supported by Beijing Natural
Science Foundation (4102007) and Grant DP0988007 from
the Australian Research Council (ARC Discovery Project).
The authors also wish to thank Prof. Peter Bruza,
Dr. Raymond Lau, Dr. Yue Xu, Dr. Xiaohui Tao, and
Dr. Xujuan Zhou for their suggestions. In addition, we
would like to thank anonymous reviewers for their
constructive comments.

REFERENCES
[1]
[2]
[3]

[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]

K. Aas and L. Eikvil, “Text Categorisation: A Survey,” Technical
Report Raport NR 941, Norwegian Computing Center, 1999.
R. Agrawal and R. Srikant, “Fast Algorithms for Mining
Association Rules in Large Databases,” Proc. 20th Int’l Conf. Very
Large Data Bases (VLDB ’94), pp. 478-499, 1994.
H. Ahonen, O. Heinonen, M. Klemettinen, and A.I. Verkamo,
“Applying Data Mining Techniques for Descriptive Phrase
Extraction in Digital Document Collections,” Proc. IEEE Int’l
Forum on Research and Technology Advances in Digital Libraries
(ADL ’98), pp. 2-11, 1998.
R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval.
Addison Wesley, 1999.
N. Cancedda, N. Cesa-Bianchi, A. Conconi, and C. Gentile,
“Kernel Methods for Document Filtering,” TREC, trec.nist.gov/
pubs/trec11/papers/kermit.ps.gz, 2002.
N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders, “WordSequence Kernels,” J. Machine Learning Research, vol. 3, pp. 10591082, 2003.
M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases
in Automated Text Categorization,” Technical Report IEI-B4-072000, Instituto di Elaborazione dell’Informazione, 2000.
C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine
Learning, vol. 20, no. 3, pp. 273-297, 1995.
S.T. Dumais, “Improving the Retrieval of Information from
External Sources,” Behavior Research Methods, Instruments, and
Computers, vol. 23, no. 2, pp. 229-236, 1991.
J. Han and K.C.-C. Chang, “Data Mining for Web Intelligence,”
Computer, vol. 35, no. 11, pp. 64-70, Nov. 2002.
J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without
Candidate Generation,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’00), pp. 1-12, 2000.
Y. Huang and S. Lin, “Mining Sequential Patterns Using Graph
Search Techniques,” Proc. 27th Ann. Int’l Computer Software and
Applications Conf., pp. 4-9, 2003.
N. Jindal and B. Liu, “Identifying Comparative Sentences in Text
Documents,” Proc. 29th Ann. Int’l ACM SIGIR Conf. Research and
Development in Information Retrieval (SIGIR ’06), pp. 244-251, 2006.
T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm
with tfidf for Text Categorization,” Proc. 14th Int’l Conf. Machine
Learning (ICML ’97), pp. 143-151, 1997.
T. Joachims, “Text Categorization with Support Vector Machines:
Learning with Many Relevant Features,” Proc. European Conf.
Machine Learning (ICML ’98),, pp. 137-142, 1998.
T. Joachims, “Transductive Inference for Text Classification Using
Support Vector Machines,” Proc. 16th Int’l Conf. Machine Learning
(ICML ’99), pp. 200-209, 1999.
W. Lam, M.E. Ruiz, and P. Srinivasan, “Automatic Text Categorization and Its Application to Text Retrieval,” IEEE Trans. Knowledge and Data Eng., vol. 11, no. 6, pp. 865-879, Nov./Dec. 1999.

ZHONG ET AL.: EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING

[18] D.D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proc. 15th Ann. Int’l ACM
SIGIR Conf. Research and Development in Information Retrieval
(SIGIR ’92), pp. 37-50, 1992.
[19] D.D. Lewis, “Feature Selection and Feature Extraction for Text
Categorization,” Proc. Workshop Speech and Natural Language,
pp. 212-217, 1992.
[20] D.D. Lewis, “Evaluating and Optimizing Automous Text Classification Systems,” Proc. 18th Ann. Int’l ACM SIGIR Conf. Research
and Development in Information Retrieval (SIGIR ’95), pp. 246-254,
1995.
[21] X. Li and B. Liu, “Learning to Classify Texts Using Positive and
Unlabeled Data,” Proc. Int’l Joint Conf. Artificial Intelligence (IJCAI
’03), pp. 587-594, 2003.
[22] Y. Li, W. Yang, and Y. Xu, “Multi-Tier Granule Mining for
Representations of Multidimensional Association Rules,” Proc.
IEEE Sixth Int’l Conf. Data Mining (ICDM ’06), pp. 953-958, 2006.
[23] Y. Li, C. Zhang, and J.R. Swan, “An Information Filtering Model
on the Web and Its Application in Jobagent,” Knowledge-Based
Systems, vol. 13, no. 5, pp. 285-296, 2000.
[24] Y. Li and N. Zhong, “Interpretations of Association Rules by
Granular Computing,” Proc. IEEE Third Int’l Conf. Data Mining
(ICDM ’03), pp. 593-596, 2003.
[25] Y. Li and N. Zhong, “Mining Ontology for Automatically
Acquiring Web User Information Needs,” IEEE Trans. Knowledge
and Data Eng., vol. 18, no. 4, pp. 554-568, Apr. 2006.
[26] Y. Li, X. Zhou, P. Bruza, Y. Xu, and R.Y. Lau, “A Two-Stage Text
Mining Model for Information Filtering,” Proc. ACM 17th Conf.
Information and Knowledge Management (CIKM ’08), pp. 1023-1032,
2008.
[27] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C.
Watkins, “Text Classification Using String Kernels,” J. Machine
Learning Research, vol. 2, pp. 419-444, 2002.
[28] A. Maedche, Ontology Learning for the Semantic Web. Kluwer
Academic, 2003.
[29] C. Manning and H. Schu¨tze, Foundations of Statistical Natural
Language Processing. MIT Press, 1999.
[30] I. Moulinier, G. Raskinis, and J. Ganascia, “Text Categorization: A
Symbolic Approach,” Proc. Fifth Ann. Symp. Document Analysis and
Information Retrieval (SDAIR), pp. 87-99, 1996.
[31] J.S. Park, M.S. Chen, and P.S. Yu, “An Effective Hash-Based
Algorithm for Mining Association Rules,” Proc. ACM SIGMOD
Int’l Conf. Management of Data (SIGMOD ’95), pp. 175-186, 1995.
[32] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
M. Hsu, “Prefixspan: Mining Sequential Patterns Efficiently by
Prefix-Projected Pattern Growth,” Proc. 17th Int’l Conf. Data Eng.
(ICDE ’01), pp. 215-224, 2001.
[33] M.F. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14,
no. 3, pp. 130-137, 1980.
[34] S. Robertson and I. Soboroff, “The Trec 2002 Filtering Track
Report,” TREC, 2002, trec.nist.gov/pubs/trec11/papers/OVER.
FILTERING.ps.gz.
[35] S.E. Robertson, S. Walker, and M. Hancock-Beaulieu, “Experimentation as a Way of Life: Okapi at Trec,” Information Processing
and Management, vol. 36, no. 1, pp. 95-108, 2000.
[36] J. Rocchio, Relevance Feedback in Information Retrieval. chapter 14,
Prentice-Hall, pp. 313-323, 1971.
[37] T. Rose, M. Stevenson, and M. Whitehead, “The Reuters Corpus
Volume1—From Yesterday’s News to Today’s Language Resources,” Proc. Third Int’l Conf. Language Resources and Evaluation,
pp. 29-31, 2002.
[38] G. Salton and C. Buckley, “Term-Weighting Approaches in
Automatic Text Retrieval,” Information Processing and Management:
An Int’l J., vol. 24, no. 5, pp. 513-523, 1988.
[39] M. Sassano, “Virtual Examples for Text Classification with
Support Vector Machines,” Proc. Conf. Empirical Methods in Natural
Language Processing (EMNLP ’03), pp. 208-215, 2003.
[40] S. Scott and S. Matwin, “Feature Engineering for Text Classification,” Proc. 16th Int’l Conf. Machine Learning (ICML ’99), pp. 379388, 1999.
[41] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[42] M. Seno and G. Karypis, “Slpminer: An Algorithm for Finding
Frequent Sequential Patterns Using Length-Decreasing Support
Constraint,” Proc. IEEE Second Int’l Conf. Data Mining (ICDM ’02),
pp. 418-425, 2002.

43

[43] R.E. Shapire and Y. Singer, “Boostexter: A Boosting-Based System
for Text Categorization,” Machine Learning, vol. 39, pp. 135-168,
2000.
[44] R. Sharma and S. Raman, “Phrase-Based Text Representation for
Managing the Web Document,” Proc. Int’l Conf. Information
Technology: Computers and Comm. (ITCC), pp. 165-169, 2003.
[45] S. Shehata, F. Karray, and M. Kamel, “Enhancing Text Clustering
Using Concept-Based Mining Model,” Proc. IEEE Sixth Int’l Conf.
Data Mining (ICDM ’06), pp. 1043-1048, 2006.
[46] S. Shehata, F. Karray, and M. Kamel, “A Concept-Based Model for
Enhancing Text Categorization,” Proc. 13th Int’l Conf. Knowledge
Discovery and Data Mining (KDD ’07), pp. 629-637, 2007.
[47] K. Sparck Jones, S. Walker, and S.E. Robertson, “A Probabilistic
Model of Information Retrieval: Development and Comparative
Experiments—Part 1,” Information Processing and Management,
vol. 36, no. 6, pp. 779-808, 2000.
[48] K. Sparck Jones, S. Walker, and S.E. Robertson, “A Probabilistic
Model of Information Retrieval: Development and Comparative
Experiments—Part 2,” Information Processing and Management,
vol. 36, no. 6, pp. 809-840, 2000.
[49] R. Srikant and R. Agrawal, “Mining Generalized Association
Rules,” Proc. 21th Int’l Conf. Very Large Data Bases (VLDB ’95),
pp. 407-419, 1995.
[50] S.-T. Wu, Y. Li, and Y. Xu, “Deploying Approaches for Pattern
Refinement in Text Mining,” Proc. IEEE Sixth Int’l Conf. Data
Mining (ICDM ’06), pp. 1157-1161, 2006.
[51] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic PatternTaxonomy Extraction for Web Mining,” Proc. IEEE/WIC/ACM Int’l
Conf. Web Intelligence (WI ’04), pp. 242-248, 2004.
[52] Y. Xu and Y. Li, “Generating Concise Association Rules,” Proc.
ACM 16th Conf. Information and Knowledge Management (CIKM ’07),
pp. 781-790, 2007.
[53] X. Yan, J. Han, and R. Afshar, “Clospan: Mining Closed Sequential
Patterns in Large Datasets,” Proc. SIAM Int’l Conf. Data Mining
(SDM ’03), pp. 166-177, 2003.
[54] Y. Yang, “An Evaluation of Statistical Approaches to Text
Categorization,” Information Retrieval, vol. 1, pp. 69-90, 1999.
[55] Y. Yang and X. Liu, “A Re-Examination of Text Categorization
Methods,” Proc. 22nd Ann. Int’l ACM SIGIR Conf. Research and
Development in Information Retrieval (SIGIR ’99), pp. 42-49, 1999.
[56] M. Zaki, “Spade: An Efficient Algorithm for Mining Frequent
Sequences,” Machine Learning, vol. 40, pp. 31-60, 2001.
Ning Zhong is currently the head of the Knowledge Information Systems Laboratory and is a
professor in the Department of Life Science and
Informatics at Maebashi Institute of Technology,
Japan. He is also the director and an adjunct
professor in the International WIC Institute
(WICI), Beijing University of Technology. He
has conducted research in the areas of knowledge discovery and data mining, rough sets and
granular-soft computing, Web intelligence, intelligent agents, brain informatics, and knowledge information systems,
with more than 200 journal and conference publications and 20 books.
He has been the chair of the IEEE Computer Society Technical
Committee on Intelligent Informatics (TCII) in 2006 to 2009. He is
currently the editor-in-chief of Web Intelligence and Agent Systems and
serves as associate editor/editorial board for several international
journals and book series. He is also the cochair of Web Intelligence
Consortium (WIC), and the chair of IEEE Computational Intelligence
Society Task Force on Brain Informatics.

44

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Yuefeng Li is the leader of the eDiscovery Lab
in the Institute for Creative Industries and
Innovation, and a professor in the Discipline of
Computer Science, Faculty of Science and
Technology, Queensland University of Technology, Australia. He has published more than
120 refereed papers (including 30 journal
papers). He is an associate editor of the
International Journal of Pattern Recognition
and Artificial Intelligence and an associate editor
of the IEEE Intelligent Informatics Bulletin. He is also a coauthor of one
book, and an editor of five books. He has supervised six PhD students
and four Master by research students to successful completion. He has
established a strong reputation internationally in the fields of Web
Intelligence, Text Mining and Ontology Leaning, and has been awarded
three Australian Research Council grants.

VOL. 24,

NO. 1,

JANUARY 2012

Sheng-Tang Wu received the MS and PhD
degrees in the Faculty of Information Technology at Queensland University of Technology,
Brisbane, Australia in 2003 and 2007, respectively. He is currently an assistant professor in
the Department of Applied Informatics and
Multimedia, Asia University, Taiwan. His research interests include data mining, Web
intelligence, information retrieval, information
systems, and multimedia. He also received the
honor of the Outstanding Doctoral Thesis Award at Queensland
University of Technology.

. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close