Intelligent Email Classifier Report

Published on May 2016 | Categories: Types, Research, Internet & Technology | Downloads: 63 | Comments: 0 | Views: 425
of 34
Download PDF   Embed   Report

Comments

Content

Swiss Federal Institute of Technology Lausanne (EPFL) Artificial Intelligence Laboratory

Mini-Project for the course Natural Language and Speech Processing academic year 1999/2000

Intelligent Email Classifier Report

Authors:

Burak Emir Samuel Fricker Yacine Saidji Lionel-Gomez Sanchez

June 25, 2000

Contents
1 Introduction and overview 1.1 Goals of the project . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overview of this report . . . . . . . . . . . . . . . . . . . . . . 2 SpiceMail analysis 2.1 Overview . . . . . . . . 2.2 Architecture . . . . . . 2.3 Preprocessing . . . . . 2.4 Language identification 2.5 Spelling correction . . 2.6 Tagging . . . . . . . . 2.7 Lemmatisation . . . . 2.8 Classification . . . . . 2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 5 5 6 7 7 8 8 8 9 10 11 12 13 13 14

3 The architecture of spoon 4 Language Identification 5 Preprocessing 6 Analysis - overview 7 Scanner 8 Spelling Corrector

9 Disambiguation 14 9.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 10 Classifier 11 The teachspoon program 16 17

12 spoon evaluation 17 12.1 Language guessing . . . . . . . . . . . . . . . . . . . . . . . . 17 12.2 Theme classification . . . . . . . . . . . . . . . . . . . . . . . 18 12.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 13 Conclusion 20

1

A The A.1 A.2 A.3

Hidden Markov Model (HMM) The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . The Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . The Training . . . . . . . . . . . . . . . . . . . . . . . . . . .

22 22 23 24 25 25 25 26 28 28 29 31 32 33

B Examples B.1 Lexicon lookup . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Disambiguation module and lemmatizing . . . . . . . . . . . . B.3 Evaluation script . . . . . . . . . . . . . . . . . . . . . . . . . C Code documentation C.1 README.txt . . . . . . . . . . . . . . . . . . C.2 files.txt . . . . . . . . . . . . . . . . . . . . . . C.3 MANUAL.txt . . . . . . . . . . . . . . . . . . C.4 Implementation of the disambiguation module D Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Figures
1 2 3 4 5 6 7 8 SpiceMail architecture . . . . . . . . . . . . . spoon architecture . . . . . . . . . . . . . . . Modularization in SpiceMail and spoon . . . . Example for lookup in lexicon . . . . . . . . . Example for disambiguation and lemmatizing Result of 1st validation run . . . . . . . . . . Result of 2nd validation run . . . . . . . . . . Result of final test run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 10 10 25 25 26 26 27

List of Tables
1 2 3 4 5 6 7 Performance of id langue on our language test corpus . . . . . Most frequent smallwords . . . . . . . . . . . . . . . . . . . . Email-specific tokens . . . . . . . . . . . . . . . . . . . . . . . Common written text tokens . . . . . . . . . . . . . . . . . . . Comparison between SpiceMail and spoon lang. identification Corpus-specific tokens . . . . . . . . . . . . . . . . . . . . . . Evaluation corpus . . . . . . . . . . . . . . . . . . . . . . . . . 7 11 14 14 18 19 19

2

1

Introduction and overview

More than 90% of people connected to the Internet use email as a communication media. The growth of the Internet being exponential, the volume of emails is exploding. Some users, e.g. major e-commerce companies, are having a hard time handling the quantity of messages they receive. In this context, an automatic email classification (dispatching) tool can dramatically improve the response time so critical in customer retention. Several approaches have been tried in the past, one of them being keyword based filtering. We propose here a email classifier based on the whole message body analysis.

1.1

Goals of the project

Our task is to write an e-mail classifier prototype based on natural language processing techniques. The prototype should be able to classify an incoming email by its contents, given a partition of the user’s email database in predefined classes. We will use the following terms for convenience: “spoon” is the name of our prototype, “script” denotes a program that does batch processing to a set of files, “NLP” is used as an abbreviation for natural language processing. We have been provided a former approach, SpiceMail[2] to the same problem and are allowed to reuse parts of its code. The various points were precised in a specification, which was validated on May 15th, 2000. We will repeat the points in abbreviated form.

Expected Results
We were to deliver: • a working, ready-to-run software package, • a report on our work, • a user manual, • the source code, 3

• and, possibly, code documentation In addition to that, we will hold a talk, presenting the software and demonstrating our approach. We discuss the results we have actually achieved in our conclusion.

Modules
Analyzing the SpiceMail software helped us dividing our problem into welldefined subtasks, which we will call modules. Our specification document [1] provides descriptions in more detail, including a short description, and specifies the team members who worked on them, and the amount of SpiceMail code we planned to reuse, as well as the requirements for this project.

1.2

Overview of this report

This report is divided into 3 parts: • analysis of the SpiceMail software (section 2) • NLP techniques used in spoon (sections 3 - 11) • evaluation of our work and conclusion (sections 12, 13)

4

preprocessing

language identification spelling incoming email correction classified email tagging

lemmatisation

classification

Figure 1: SpiceMail architecture

2
2.1

SpiceMail analysis
Overview

SpiceMail is the result of a mini-project which has been completed in 1998. The objectives of our effort and theirs are identical, so the hope of reusing their programs with only minor adaptations and being able to concentrate on pertinent feature extraction seemed well founded. This section contains the results of the analysis we have performed on their software as well as the conclusions we have drawn. However, it should be stressed, that every program we discuss here is the result of the SpiceMail team.

2.2

Architecture

SpiceMail is an e-mail classifier prototype, more specifically a chain of programs that extract features from an incoming email and then applies the naive Bayesian classification method. The knowledge about the differences between the classes has to be accessible to the program. This is achieved by supervised learning, meaning presenting a set of already classified patterns to the classifier module, which calculates and stores information about features 5

and classes. The SpiceMail developer team achieved separation of levels by rigorously dividing their problem in subtasks. Every subtasks corresponds to a seperate program. Figure 1 (taken from the SpiceMail report [2]) makes this point clear. What is not shown here is the module that parses mail header lines to separate the message header from the message body. This is achieved by a Perl library module used by the main driver script SpiceMail.

2.3

Preprocessing

SpiceMail relies on the Perl interpreter doing the preprocessing. Here is the corresponding part of the source code, with our comments (“remove” means “replace with blank”):
$line $line $line $line $line $line $line $line $line $line $line $line $line $line $line $line $line $line $line $line $line $line =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ =~ s/\n/ /g; s/\’/\’ /g; s/\"/ /g; s/\,/ /g; s/\:/ /g; s/\;/ /g; s/\(/ /g; s/\)/ /g; s/\?/ /g; s/\!/ /g; s/\-/ /g; s/\+/ /g; s/\&/ /g; s/\*/ /g; s/\\/ /g; s/\/\// /g; s/\// /g; s/\@/ /g; s/>/ /g; s/</ /g; s/_/ /g; s/\./\n/g; # # # # # # # # # # # # # # # # # # # # # # remove newlines ex: aujourd’hui -> aujourd’ hui remove " remove , remove : remove ; remove ( remove ) remove ? remove ! remove remove + remove & remove * remove \ remove // remove / remove @ remove > remove < remove _ replace . with newline

?? ?? ?? ?? ?? ?? ?? ??

??

??

Lines marked with “??” show which of these transformations are questionable. We will see in the following sections, how this choice affects the performance of SpiceMail.

6

2.4

Language identification

The language identification module is a seperate program, which is invoked by the SpiceMail main script after preprocessing is done. Test Corpus We collected texts in English, French and German, Italian and Spanish from the Internet. A script performed the testing of the language identification module on this corpus. The use of two spanish files exhibited an implementation problem: the program does not terminate on these (larger than 500 kB) files. The results in table 1 show that SpiceMail preprocessing contains some flaws – the program performs significantly better on original text than on preprocessed text. Unfortunately, this is all we can say about the modules, because we could not test the other modules as we shall see below.

2.5

Spelling correction

This module is not operational. The corresponding lines in the Perl script are commented out. The source code cannot be compiled. 1 . Furthermore, the outdated version of the SLPtoolkit library uses different names for procedures, whicke makes reuse impossible.
The fact that there are clashing namespaces for SLPtoolkit and GNU I/O libraries, type String shows that the SpiceMail team had to abandon this module
1

[lang]

[nfiles]

[nsucc pre]

[rate pre]

[nsucc nopre]

[rate nopre]

English French German Italian Spanish

14 16 15 14 8

14 14 15 13 6

100 87.5 100 92.8 75

14 16 15 14 6

100 100 100 100 75

Table 1: Performance of id langue on our language test corpus

7

2.6

Tagging

SpiceMail uses a rule-based based tagger, with rules obtained by means of supervised learning. Their report does not document performance. We could not evaluate this part because this we have not been in possession of a handtagged french test corpus at this time.

2.7

Lemmatisation

This module is cannot be compiled because it uses an old version of the SLPtoolkit library. As a consequence, making this module reuseable would have consumed much more of our limited ressources than rewriting the routine. Testing the binary version was out of reach, because the program contained a hard-coded file reference to a SLPtoolkit lexicon which has been moved to another location.

2.8

Classification

It seems that this module was at an early development stage. The unknown words problem was not addressed. The source code was lacking comments. Therefore we did not manage to compile it. We preferred not to rely on this component and decided to adapt the existing code written by one of our team members.

8

2.9

Conclusion

We draw the following conclusions from the discussed results: • SpiceMail preprocessing is suboptimal. While punctuation symbols have to be treated by the parser, special characters like control codes or mathematical symbols can be handled by a lexical analyzer (commonly called scanner). • The ordering of modules is incorrect. While we could not explore the effects of the preprocessing routine in depth, we saw that the smallword technique showed reasonable results without any form of preprocessing. • The language identification module is not robust: It crashes on large input files. Finding this error seems possible, but since our ressources were limited and the working environment failed to provide us with adequate tools, we chose to reimplement this. • The programs that cannot be compiled make reusing of code more costly than rewriting. The last point shows that we were much too optimistic about code reuse in our specification. This means complete reimplementation is necessary. The incomplete status of the syntax analysis modules for spelling correction and parsing led the SpiceMail team to the use of a separate (3rd party) tagger and an additional lemmatizer program. All these modules can be integrated in one syntax analysis program, which renders the streaming of information through the modules more efficient. Finally, we decided not to analyze the supervised learning and evaluation scripts, since we had a corpus in a format significantly different from theirs. It should be noted also, that the SpiceMail team needed to construct an evaluation corpus themselves, which costs time and ressources (in terms of programmers).

9

language identification

french?

yes! incoming email

preprocessing database classified email

syntax analysis

classification spoon mail directory for every class: for every mail in class: process; accumulate words teachspoon learn

Figure 2: spoon architecture

3

The architecture of spoon

Our architecture is closely related to that of SpiceMail, however, they are some differences, as seen in figure 2. Figure 3 how the division into separate programs has changed. What is not shown here is the module that parses mail header lines to separate the message header from the message body. This is achieved by a python library module (Mimetools - package). [module] Language Identificator Orthographic corrector Tagger Lemmatizer Classifier Preprocessor Driver (main script) Cutter [SpiceMail file] id langue orthographe tagger lemmat id theme SpiceMail SpiceMail n.a. [spoon file] idlang reader reader reader classif spl spoon.py spoon.py

Figure 3: Modularization in SpiceMail and spoon

10

Eng the and to of a in was his that I he as had

Fre de la le et des les du je en un que a qui

Ita di e il che la a in per del un le non i

Spa de la que el en y a los del se por las con

Ger der die und den in von ist zu dem des fr mit das

Table 2: Most frequent smallwords

4

Language Identification

As we saw in section 2.4, SpiceMail’s language identification module does not terminate when given large files as input. This is worse than misclassification and led us to include the point “debugging of id langue” in our specification. However, we encountered difficulties. The lack of a working user-friendly debugger and the obscurity of the bug (an assignment to a GNU I/O Library String class) lead to the complete reimplementation of id langue. The task of this module is to decide whether a given text – the email body – is in french, english, german, spanish or italian. We implemented the smallword technique to do this. Rather than expressing this formally, we shall explain this in natural language. The essence of this technique consists of counting the occurrences of certain words from the languages among which we have to choose. We thereby make the assumption that the most frequent words in European languages are rather small, which positively influences pattern matching complexity. Each language is therefore represented by a set of small words (see table 2). Note that this table differs from the one presented in the SpiceMail report. We erased punctuation symbols, stating that they do not carry any information about the language.

11

Implementation Since smallwords is a fairly simple technique based on sequences of characters ( smallwords cannot be composed and do not depend on the context ), they can be considered as trivial regular expressions. We have used Flex, a lexical analyzer generator, which are widely used in the compiler context (see [6] for documentation), which translates a given set of rules into a finite state automaton, which performs a certain action when it hits a token. In our case, tokens are smallwords and actions mean incrementing a counter. To achieve a clean separation of program logic from program data, we created a script that transforms a collection of smallword data files into a binary executable program. Adding a new language is reduced to : 1. obtain a smallword file for the language, and 2. add the name of the language and a score counter to the description of the program. Our scripts transforms these in a flex-source, flex compiles them to C, and the GNU C compiler gives the resulting binary. Since we discovered, that the smallword technique alone gave satisfying results in terms of accuray, but in significantly less time, we decided not to integrate the trigram technique, but since these also form regular expressions, there should be no problem in adapting these and thus replacing the smallword technique by this more accurate way of guessing languages.

5

Preprocessing
• detection of quotes Lines that start with “> “ are called quotes. They contain original text to which the sender responds to. We seperate new text from quotes by inserting paragraph delimiters. • deleting of multiple blanks in some (correctly written) composed words, e.g. “parce que” becomes “parce que”

We use some further preprocessing before scanning the input for tokens:

The use of these treatments will become clear in the next section when we discuss scanning and spelling correction.

12

6

Analysis - overview

We perform syntactical analysis in order to filter out the “right” features and pass them to our classifier. We shall discuss the choices we have made here, along with a few examples. There are three levels: • the scanner • the spelling corrector • the disambiguation module (tagger and lemmatizer) We have chosen to integrate these three modules, because they all rely on the SLPtoolkit lexicon, more precisely about the correct spelling of words, the morphosyntactical-categories, probabilities and lemmes.

7

Scanner

As we are interested in syntactical analysis, we have to define which sequences of characters should be treated as words. Surprisingly, the definition of a word is not so trivial as it might seem, since there are things like composed words, special forms of written language, and in our context even email-specific phenomena which do not correspond to a linguistic definition of a word. This is achieved in our scanner module. It detects tokens by means of regular expressions. Words, specified as an arbitrary sequence of (french alphabet) characters, are treated as tokens, but also email adresses, uniform resource locators. Tables 3 and 4 show some of the tokens which are recognized. The implementation has been done using Flex, which has been described earlier. Here again, seperation from logic and data is important, not only to stay flexible in regard of changing token definitions, but also because the definition of non-lexical words could be useful in other contexts as well. Since we are performing spelling correction, our definition of a word has to be coherent with the lexicon. As an example, if the lexicon considers “parce que” as one word, then a spelling correction of “parce” will return an unsatisfying result: “parce” is certainly not a french word, but there could be words which are similar to it and therefore returned as a solution.

13

URL email adress IPv4 adress smileys

http://www.w3c.org/specs [email protected]fl.ch 129.231.24.0 :-) :) 8-D

Table 3: Email-specific tokens

8

Spelling Corrector

The spelling corrector is built on top of the scanner. It receives tokens and performs spelling correction only for those which could be in the lexicon. A technical detail is, that tokens get “flattened” to achieve a clean interface to the disambiguation module, see below. We adapted the error-tolerant finite state recognizer engine described by Oflazer [3]. If a word is spelled correctly (or within only one upper case - lower case conversion) we return its (possibly several) meanings to the disambiguator. If not, we pass a list of all words which are within editdistance 1. An example can be found in appendix B.

9

Disambiguation

The goal of the disambiguation module is to eliminate ambiguous interpretations of tokens that have been produced by the lexical correction module. Once the disambigation has been achieved, we perform lemmatizing and use the resulting lemmes for classification. The disambiguation is based on a model of contextual information, which is modelled by a hidden Markov model (HMM).The context of a token is here defined as a certain number of neighbouring tokens inside the text. The usage of HMM has the advantages of being well understood and of having existing efficient algorithms. It is the Viterbi algorithm that has been used for producing the disambiguaprices dates numbers abbreviations 2.50 1.1.2000, 07-14-2323 7, 10’000, 12.000 S.N.C.F., P.T.T.

Table 4: Common written text tokens

14

tion. This algorithm has linear complexity, i.e. O(n), where n is the length of the sequence of tokens to be analysed. A detailed description of the background and the algorithm can be found in Appendix A.

9.1

Design

The Analysis Given as input to the analysis module is a sequence of ambiguous tokens. Such a sequence covers a whole paragraph. An ambiguous token is represented by • the tag WORD together with a list of lexical entries if the token has been found in the dictionary • the tag SPECIAL if the token has not been found in the dictionary • the tag PAR if the token indicates the end of the paragraph • the tag EOF if the token indicates the end of the file Note that the definition of tokens needed to adapted in the spelling corrector to stay consistent with the definition of the corpus used for supervised learning. As an example, “.” is regarded as a word, and later on tagged as a punctuation symbol. A complete paragraph, delimited by PAR or EOF, is analysed according to the Viterbi algorithm explained above. For a known word, the considered lexical categories are noun, verb, pronoun, adjective, determiner, adverb, apposition, conjunction, interjection, residual, punctuation, and unknown. For an unknown word, the HMM is used to predict its lexical category. In order to do that, the sequence wk−1wk is considered. wk is supposed to be unknown. If the category for wk−1 is Ck , then the category C is used for the unknown word wk so that the probability p(C|Ck ) is maximised.Thus, the overall probability is now P (W n |C n ) · P (C n ) = P (W k |C k ) · P (C k ) ·
i∈knownwords i−1 P (wi |Ci) · P (Ci |Ci−k )

1 i−1 · P (Ci|Ci−k ) n i∈unknownwords 1 where n is the probability of an unknown word given the category Ci. It is for simplicity reasons that this probability has been chosen to be distributed · 15

according to the uniform distribution: The factor puting the score. The Training

1 n

can be ignored for com-

The lexical probabilities p(Ci |Cj ) and p(Ci |0) have been estimated on a completely annotated corpus of the GRACE project (Grammars and Ressources for Analysers of Corpora and their evalution).Thus, the training here is completely supervised. This corpus contains approximately 125’000 entries. The initial probabilities p(Ci |0) have been calculated at the beginning of each sub-corpus. The probability of a word given its lexical category p(wk |Ci) has been estimated on a large corpus. It has been furnished together with the tool SLP-Toolkit[4] and is coded in the dictionary. Further details about the implementation can be found in appendix C.4.

10

Classifier

The classification algorithm used in this project is the Simple Bayes Classifier, sometimes called Naive-Bayes. It is a algorithm that stores during a supervised learning phase the conditional probability of each feature of a class as well as the probability of the considered class. In the following, an incoming message will be represented as a feature vector X. To find the correct class, the algorithm uses an evaluation function to rank the alternative classes C based on their distributions, and assigns the message to the highest scoring class. The evaluation function is P (X = x | C = ck )P (C = ck ) P (X = x) The strong assumption used in the Simple Bayes Classifier states that, for a class C, each feature Xi , in our case the tokens composing the message, is independent of every other feature. This hypothesis is commonly called the ’bag-of-words’ approach. In this situation we have P (C = ck | X = x) = P (X = x | C = ck ) =
i

P (Xi = xi | C = ck )

which can be computed as following logP (X = x | C = ck ) =
i

logP (Xi = xi | C = ck )

Sometimes a feature x of the message to classify is not present in a class C. In that case, the probability p(x|C) has to be estimated in a way that will 16

not distort the computation. If we set it to zero, it will nullify the effect of all other probabilities in the product. To avoid this, we use the following hypothesis: If the classifier would have been trained with an infinite set of messages, then the probability p(x|C) would converge on a small, non-zero value. In our implementation, we chose to use the value ǫ = 0.000001 for the probability of unknown words, see also section 12 where we talk about evaluation.

11

The teachspoon program

This script performs the act of “teaching” spoon the different classes. It opens every file in a given directory (e.g. “/home/Mail”) and transforms them file to lists of emails. It is here where the supervised learning is done. Every list of emails gets processed through the whole chain of language identification, preprocessing and syntax analysis modules, and the results get accumulated for each class. More precisely, we process every mail, append it to a temporary file, every mail separated by a line containing only a dash (“-”). Then we present the corpus of processed emails to the classifier module with the filename as the name of the class to learn. After having done this for every class, supervised learning is finished, and the classifier is ready to use. We have created the entire mechanism to automatically teach an entire mail directory to the spoon classifier. However, we adapted the routine to our evaluation corpus. To turn this prototype into a real application, one need just to change the function which reads a Mailbox file. Details are given in appendix C.

12
12.1

spoon evaluation
Language guessing

We evaluated the performance of our language identification module seperately to compare it with SpiceMail’s id langue. The results are given in table 5. The two Spanish files were correctly classified now. It should be noted that the results for emails cannot be directly compared to these, since • emails can contain foreign words, and • they can be very short and therfore do possibly not contain any smallword at all. 17

We have evaluated language identification on our validation corpus, which gave us a classification rate of 92 % ( 209 of 227 correct ). Our explanation for this is that email messages can be very short and contain many smallwords borrowed from another languages.

12.2

Theme classification

We decided not to integrate the language identification module into our evalutaion. It is clear that in practice, the integration of idlang will lead to less correctly classified emails, and in a 100% - french environment, a user benefits from disabling language guessing. As we are interested in determining the classification rate of spoon given these mails, we have disabled the module.

Test Corpus We have been provided a test corpus consisting of 1400 mails in 4 classes. The mails were not uniformly distributed, the exact distribution is given in Table 6. The original corpus was created by logging 4 newsgroups. To ensure pertinence of our evaluation, we eliminated crossposts (mails that appear in several classes), thus creating a new corpus. Furthermore, we adapted our scanner, since some tokens have been removed by the providers of our corpus for privacy. The justification for this is, if there would have been tokens, we could have recognized them, so we can safely recognize the replacement which has been effected. In the following, we have made the hypothesis that each one of the 4 classes consists only of pertinent emails, thus neglecting [lang] [nfiles] [rate SpiceMail] [time / sec] [rate spoon] [time spoon]

English French German Italian Spanish1 Spanish2

14 16 15 14 6 8

100 100 100 100 75 –

213.20 219.69 219.90 157.13 69.15 –

100 100 100 100 – 100

4.45 5.24 5.27 4.69 – 3.68

Table 5: Comparison between SpiceMail and spoon lang. identification 18

names telephone no adresses

NOM361 <telephone> <removed>

Table 6: Corpus-specific tokens corpus class1 class2 class3 class4 total perc.

original pertinent

1275 1274

117 110

33 27

17 1442 – 16 1427 100%

learning set validation set test set

824 201 249

65 23 22

17 2 8

5 5 6

911 231 285

64 % 16 % 20 %

Table 7: Evaluation corpus emails that will not appear correctly classified to a human observer (e.g. test messages, off-topic messages etc). Ensuring further pertinence by hand was not within reach because of our ressources being limited.

Cross Validation We decided to use crossvalidation to ensure that we do not optimize our classifier in a way which is too much adapted to the information contained in the corpus. The parameters are: • the choice of features that are put out by the syntax analysis module, • epsilon for unknown words in classifier.

12.3

Results

We do not mention the measures we got during testing and debugging, only the fact that no feature selection at all leads to catastrophic 7 % , while filtering this to all tokens to more than 3 characters nearly triples the result to 19 %. After having a working version of the disambiguation module, we expected 19

a dramatic change, but got a minor gain, 25 %. This initiated a long search for the reason. After double-checking that this was not a bug, we finally discovered that our choice of ǫ, the score attached to unknown words in the classifier module, was too big (0.001). After having fixed the ǫ to 0.000001, we arrived at 95 % classification rate on the validation corpus. The second validation run showed a classification rate of 96.5 %, after tuning the feature extraction to reject “words” consisting of only 1 character. In the final performance measure, spoon performed at the same level, 96.1 %.

13

Conclusion

From our point of view, we succeeded in applying natural language techniques to a real lif problem. In the following, we will review the points from our specification:

Software
We have a delivered a working software package like we stated in our specification. A point we neglected is the segmentation feature, however correct classification was a priority to us. The question of an automatic choice of ǫ rests open. Nevertheless, we got a final classification rate of 96.1 % on a real-life problem, which is rather satisfying.

Documentation
We supplied this report and a user manual (see Appendix C) in the source code package.

Source code
We have handed over a package contain all sources and our corpusses which make it possible to reproduce the results we achieved. A documentation can be found in Appendix C.

20

Acknowledgements
We would like to say thanks to: • everyone at the Programming Methods Laboratory for providing lots of coffee and a formidable working environment. • Benjamin Barras and Roland Wuillemin (SIDI-EPFL) for helping understanding the SpiceMail Perl scripts, • Federico Introvigne for identifying a non-pertinent (medieval) Italian text in our language test corpus, and finally • Romaric Besancon and Antoine Rosenknop (LIA-DI-EPFL) for their helpful comments, • our supervisors Jean-C´dric Chappelier and Martin Rajman (LIA-DIe EPFL) for providing us this excellent challenge.

References
[1] Burak Emir, Samuel Fricker, Yacine Saidji, and Lionel-Gomez Sanchez. e-mail filter specification May 2000 [2] Julien Mercay, Alfonso Costa, Vojto Elias, Raphael Rossi, and Loic Samsson. SpiceMail, Filtre automatique pour courrier electronique June 1998 [3] Kemal Oflazer. Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Computational Linguistics, 22(1), 1996. [4] Jean-C´dric Chappelier Introduction ` la librairie SlpToolkit, e a http://liawww.epfl.ch/∼ chaps/cours-tal/IntroSlpToolkit.html [5] James Allen: Natural Language Understanding, Benjamin/Cummings Publishing Company Inc., 1995, pp202 and p215 [6] Aho, Sethi, Ullman. Compilers:Principles, Techniques, and Tools. Addison Wesley, 1986, Chapter 3: Lexical Analysis

21

A

The Hidden Markov Model (HMM)

This appendix shall give more background for section 9. Let’s make the following hypothesis: Be W n = w1 . . . wn C n = C1 . . . Cn Win = wi . . . wn Cin = Ci . . . Cn a sequence of n words, a sequence of n morpho-syntactic tags, a subsequence of Wn of length n − i starting at position i, a subsequence of Cn of length n-i starting at position i.

The HMM states that an optimal interpretation Cn for a given sentence Wn should be chosen such that P (C n |W n ) is maximised for all Cn . According to Bayes’ rule, we can write P (C n |W n ) = P (W n |C n )P (C n ) P (W n )

Since W n is given, the maximization problem is : P (C n |W n ) = P (W n |C n )P (C n ) In order to simplify the calculations, we make the following additional assumptions: • The lexical conditioning is limited: P (wi | . . . Ci . . .) = P (wi |Ci)
i−1 • There is a horizon for syntactic dependence: P (Ci|Ci−1 ) = P (Ci|Ci−k )

Using these assumptions, we can express the hidden Markov chain of order k, which is to be maximised for a given sentence instead of the full model:
n

P (W |C ) · P (C ) = P (W |C ) · P (C ) ·
i=k+1

n

n

n

k

k

k

i−1 P (wi|Ci ) · P (Ci |Ci−k )

A.1

The Viterbi Algorithm

The Viterbi algorithm maximises P (C n |W n ) using the assumptions described in the preceding paragraph. Its complexity is O(n), where n is the length of the token sequence to be analysed.

22

A.2

The Analysis

The inputs to the algorithm are: • The word sequence W n = w1 , . . . , wn • The lexical categories C m = C1 , . . . , Cm • The lexical probabilities p(Ci |Cj ) The algorithm finds the most likely sequence of lexical categories Cn based on bigram analysis. It uses two tables of size n · m: • SEQSCORE(n, i), which records the probability for the best sequence up to position i that ends with a word in category Cn . • BACKPTR(n, i), which indicates for each category in each position what the preceding category is in the best sequence at position i − 1. In pseudo-code, the algorithm performs in the following way: Initialisation For i = 1 to m do SEQSCORE(i, 1) = p(w1 | Ci ) · p(Ci | 0) BACKPTR(i, 1) = 0 Iteration Step For t = 1 to n do For i = 1 to m do SEQSCORE(i, t) = maxj=1..m(SEQSCORE(j, t-1) · p(Ci | Cj )) · p(wt | Ci ) BACKPTR(i, t) = index of j that gave the max above Sequence Identification Step C(n) = i that maximises SEQSCORE(i, n) For i = n-1 to 1 do C(i) = BACKPTR(C(i+1), i+1)

23

A.3

The Training

The training consists of establishing • The lexical probabilities p(Ci |Cj ) • The initial lexical probabilities p(Ci |0) • The probability of a word given a lexical category p(wk |Ci ) These probabilities can be estimated with the help of large corpus.

24

B

Examples

This section shows a few examples of modules working in separation.

B.1

Lexicon lookup

Figure 4 shows the result of a looking up possible meanings for the french sentence “le chat mange la souris”. As we see, every word is found in the lexicon, but for “le”, “mange”, “la” and “souris” we have more than one possibility, one for each possible morpho-syntactical category to which this word can belong. The number denotes an index in the SLPtoolkit lexicon.
[141]in1sun18-bemir$>$ echo " le chat mange la souris. "|justortho2 [le(297854),le(297853)] chat [mange(309989),mange(309988),mange(309987),mange(309986), \ mange(309985)] [la(294337),la(294336),la(294335),la(294334)] [souris(472354),souris(472353),souris(472352),souris(472351), \ souris(472350),souris(472349),souris(472348)] .

Figure 4: Example for lookup in lexicon

B.2

Disambiguation module and lemmatizing

The disambiguation module chooses the right solution and finds the canonical form of a word, if possible. The result can be seen in figure 5. Note the correct lemmatisation of “souris”, which could also have been a form of “sourire”.
[142]in1sun18-bemir% echo " le chat mange la souris. " | reader chat manger souris

Figure 5: Example for disambiguation and lemmatizing

25

B.3

Evaluation script

Figure 6 shows the output of the first successful run of our evaluation script, June 23rd at 22:37, after hand-tuning the epsilon value to 0.000001.
[280]in1sun18-bemir% ../eval/evalspoon This is evalspoon. We will evaluate several classification strategies [reading Corpus in dir ../Valid_Corpus] [loading file: ent.petites-annonces.immobilier done (197 mails)] [loading file: ent.rh.stagiaire done (2 mails)] [loading file: ent.rec.betisier done (5 mails)] [loading file: ent.comp.os.windows done (23 mails)] [evaluating performance (227 msgs) bad result for tarifs at position 0 done] [evaluating 227 mails took 1602.633396 sec.] classification rate: 0.951542 ( 216 of 227 correct ) how many times "Classification_Failed" : 0.000000 (0 of 227) bye.

Figure 6: Result of 1st validation run

This is evalspoon. We will evaluate several classification strategies [reading Corpus in dir ../Valid_Corpus] [loading file: ent.petites-annonces.immobilier done (197 mails)] [loading file: ent.rh.stagiaire done (2 mails)] [loading file: ent.rec.betisier done (5 mails)] [loading file: ent.comp.os.windows done (23 mails)] [evaluating performance (227 msgs) done] [evaluating 227 mails took 1110.756479 sec.]cla how many times "Classification_Failed" : 0.000000 (0 of 227) bye.

Figure 7: Result of 2nd validation run

26

# --------------------------------------------------- FINAL MEASURE # ------------------------------- evaluating on test set ../eval/evalspoon ../Corpus/Final_Corpus This is evalspoon. We will evaluate several classification strategies [reading Corpus in dir ../Corpus/Final_Corpus] [loading file: ent.petites-annonces.immobilier done (248 mails)] [loading file: ent.rh.stagiaire done (8 mails)] [loading file: ent.comp.os.windows done (22 mails)] [loading file: ent.rec.betisier done (6 mails)] [evaluating performance (284 msgs) done] [evaluating 284 mails took 1511.455146 sec.]cla how many times "Classification_Failed" : 0.000000 (0 of 284) bye.

Figure 8: Result of final test run

27

C
C.1

Code documentation
README.txt

spoon ----This file (README.txt) contains instructions to build and test spoon. For the user manual : have a look at doc/MANUAL.txt How to build: ------------type: >gmake in this directory. All modules will be build and moved to the bin directory. >cd bin To run a demonstration of the testing framework, run >gmake small This will train spoon on a Corpus_small_learn, and evaluate its performance on Corpus_small_valid. This says nothing about the classification rate and shows just that our program and the evaluation framework actually works. To reproduce our results from the report, type: >gmake eval This will train spoon on Learn_Corpus, evaluate on Valid_Corpus. To evaluate performance on the test set, type >gmake finalset

28

C.2

files.txt

This file contains only a list of files and directories, please see README for instructions for building and testing. this directory ------------------------------------------------------------files.txt Makefile bin share main.src idlang.src preproc.src reader.src classifier data.smallwords.new eval test.data - this file - our main Makefile - contains binaries - shared programs sources sources sources sources sources for for for for for main script, teaching script idlang preproc module syntax analyzer classifier

- contains data needed to build idlang module - contains helper scripts for evaluation - some testing, debugging data (emails)

Corpus/ directory ----------------------------------------------------------Learn_Corpus Valid_Corpus Final_Corpus Eval_Corpus LEARNING SET VALIDATION SET TEST set LEARN+TRAIN set

./data.smallwords.new: ----------------------------------------------------eng.sw - smallwords for English fre.sw ger.sw ita.sw spa.sw ./eval: -------------------------------------------------------------------createTestSet - create Test and LearnValid Set createValidSet - Learning and Validation Set evalspoon - script that computes classification rate evallang - " for language identification

29

justload.py killcrossp.py

- helper script to load corpus file - helper script to eliminate crossposts

./idlang.src:-------------------------------------------------------------idlang_builder.py - script that build flex source file idlang_builder_conf.py - contains the description of the flex program Makefile - makefile ./main.src:---------------------------------------------------------------Makefile - for testing only readMailbox.py - helper script that reads one mailbox file spoon.py - our main script teachspoon - symbolic link on teachspoon.py teachspoon.py - the training script ./preproc.src: -----------------------------------------------------------Makefile - Makefile spl.flex - our preprocesser module ./reader.src: ------------------------------------------------------------Makefile - Makefile __old - contains old versions of programs hmm.c - hmm routines teach_hmm.c - teacher HMM source tokendef.h - definitions of tokens (not all are used) token.c - token representation, mem. management token.h - " header slp_corr2.c - sources for orthocorrection slp_corr2.h - " header edit-dist-inc.c - functions to calculate Levinshtein distance hmmpp.flex - scanner source main.c - main source corpus-grace.txt - tagged corpus table-cms - contains names of tags ./share:------------------------------------------------------------------child_process.py - UNIX named pipes crossvalid.py - partition nlpmail.py - internal representation of one email message shuffle.py - shuffles a list ./test.data: <several test messages>

30

C.3

MANUAL.txt

spoon user manual -----------------spoon can only be used in the DI environment at EPFL, since it relies on the SLPtoolkit installation. In the following, we assume that spoon has been built following the instructions given in the appendix of our report. Teaching spoon -------------Before spoon can be used, it needs to be trained*.

If your Mail directory is /home/username/Mail, type:

> teachspoon /home/username/Mail If you have three mailbox folders ‘‘myMailboxfile’’, ‘‘loveletters’’, and ‘‘flame’’, you should get output like this: This is teachspoon. I will teach the spoon program how to classify your mail. [loading file: myMailboxfile done (2 mails)] teaching class "myMailboxfile" [start processing (2 mails)] [processing msg bodies done] [passing words to classifier module] [loading file: loveletters done (1 mails)] teaching class "loveletters" [start processing (2 mails)] [processing msg bodies done] [passing words to classifier module] [loading file: flame done (1 mails)] teaching class "flame" [start processing (2 mails)] [processing msg bodies done] [passing words to classifier module] [teaching 6 mails took 35 sec.] your biggest class is myMailboxfile1 with 2 mails ( p(C)= 0.333333 ) done with teaching. Enjoy your new email filter ! Using spoon -----------

31

Now add this line to your ‘‘/home/username/.procmailrc’’ - file, it will cause incoming mail to be processed by spoon: |/home/username/spoondir/bin/spoon.py Voila ! Your email should be automatically classified. If at any time, you should wish to change your classes, make the changes to your files in your mail directory and be sure to rerun teachspoon to make spoon learn the new classification of your mail. REMARK: * = a input filter specific to the used mail client has to be written. This means, spoon is not operational yet

C.4

Implementation of the disambiguation module

The disambiguation module is implemented in C and closely linked to the scanner and lexical corrector. It consists of the following files: main.c • This module initialises the scanner, lexical corrector and the disambiguation module. • It collects the sequence of words given on the standard input and processed by the scanner and lexical corrector. The sequence for a complete paragraph is then transmitted to the HMM module. From the disambiguated sequence of tokens, only nouns, verbs and adjectives are chosen and put out on the standard output. After that, the process of words collection repeats until the end of the to be processed file is encountered. • Finally, the three sub-modules are finalised. hmm.c • This module defines the data structures for the HMM, called knowledge base and the Viterbi analysis. Functions for manipulating them are implemented. • The function teach model estimates the lexical probabilities based on a fully annotated corpus. • The function analyse text finally implements the Viterbi algorithm for disambiguating a paragraph. 32

D

Further reading

We have found many pointers to documents which are related to our report. Here’s a list of them. • G. Forney. The viterbi algorithm. Proceedings of the IEEE, 61:268– 278, March 1973. • Stan Kulikowski. Using short words: a language identification algorithm. Unpublished technical report, 1991. • Gregory Grefenstette. Comparing two language identification schemes. Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT’95), Rome, Italy, Dec. 1995 • Richard Duda, Peter Hart: Pattern Classification and Scene Analysis. Wiley, 1973 • D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. A practical partof-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, 1992. • Irving John Good, The Estimation of Probabilities: An Essay on Modern Bayesian Methods. M.I.T. Press, 1965 • Eric Brill. A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing. Association for Computational Linguistics, 1992. • Adda, Blache, Mariani, Paroubek, Rajman. Action grace: Mise en place du paradigme d’´valuation. application au domaine de l’analyse e morpho-syntaxique. In Actes de la Conf´rence sur le Traitement du e Langage Naturel (TALN’95), 1995

33

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close