Undergraduate Thesis

Published on June 2016 | Categories: Documents | Downloads: 24 | Comments: 0 | Views: 390
of 48
Download PDF   Embed   Report



Open Domain Factoid Question
Answering System
By Amiya Patanaik

Thesis submitted in partial fulfilment of the
Requirements for the degree of Bachelor of Technology (Honours)

Under the supervision of

Dr. Sudeshna Sarkar
Professor, Department of Computer Science

KHARAGPUR – 721302

MAY 2009


Department of Electrical Engineering
Indian Institute of Technology


This is to certify that the thesis entitled Open Domain Factoid Question
Answering System is a bonafide record of authentic work carried out by Mr. Amiya
Patanaik under my supervision and guidance for the fulfilment of the requirement for the
award of the degree of Bachelor of Technology (Honours) at the Indian Institute of
Technology, Kharagpur. The work incorporated in this has not been, to the best of my
knowledge, submitted to any other University or Institute for the award of any degree or

Dr. Sudeshna Sarkar (Guide)
Professor, Department of Computer Science Date :
Indian Institute of Technology – Kharagpur Place : Kharagpur

Dr. S K Das (Co-guide)
Professor, Department of Electrical Engineering Date :
Indian Institute of Technology – Kharagpur Place : Kharagpur


I express my sincere gratitude and indebtedness to my guide, Dr. Sudeshna Sarkar
under whose esteemed guidance and supervision, this work has been completed. This
project work would have been impossible to carry out without her advice and support

I would also like to express my heartfelt gratitude to my co-guide Dr. S. K. Das and all
the professors of Electrical and Computer Science Engineering Department for all the
guidance, education and necessary skill set they have endowed me with, throughout my
years of graduation.

Last but not the least; I would like to thank my friends for their help during the course
of my work.


Amiya Patanaik
Department of Electrical Engineering
IIT Kharagpur - 721302


Dedicated to
my parents and friends


A question answering (QA) system provides direct answers to user questions by
consulting its knowledge base. Since the early days of artificial intelligence in the 60’s,
researchers have been fascinated with answering natural language questions. However,
the difficulty of natural language processing (NLP) has limited the scope of QA to
domain-specific expert systems. In recent years, the combination of web growth,
improvements in information technology, and the explosive demand for better
information access has reignited the interest in QA systems. The wealth of information
on the web makes it an attractive resource for seeking quick answers to simple, factual
questions such as “who was the first American in space?” or “what is the second tallest
mountain in the world?” Yet today’s most advanced web search services (e.g., Google,
Yahoo, MSN live search and AskJeeves) make it surprisingly tedious to locate answers to
such questions. Question answering aims to develop techniques that can go beyond the
retrieval of relevant Documents in order to return exact answers to natural language
factoid questions, such as “Who is the first woman to be in space?”, “Which is the largest
city in India?”, and “When was first world war fought?”. Answering natural language
questions requires more complex processing of text than employed by current
information retrieval systems.

This thesis investigates a number of techniques for performing open-domain factoid
question answering. We have developed an architecture that augments existing search
engines so that they support natural language question answering and is also capable of
supporting local corpus as a knowledge base. Our system currently supports document
retrieval from Google and Yahoo via their public search engine application
programming interfaces (APIs). We assumed that all the information required to
produce an answer exists in a single sentence and followed a pipelined approach
towards the problem. Various stages in the pipeline include: automatically constructed
question type analysers based on various classifier models, document retrieval, passage
extraction, phrase extraction, sentence and answer ranking. We developed and analyzed
different sentence and answer ranking algorithms, starting with simple ones that
employ surface matching text patterns to more complicated ones using root words, part
of speech (POS) tags and sense similarity metrics. The thesis also presents a feasibility
analysis of our system to be used in real time QA applications.


Chapter 1: Introduction 9
1.1 History of Question Answering Systems 9
1.2 Architecture 10
1.3 Question answering methods 11
1.3.1 Shallow 11
1.3.2 Deep 11
1.4 Issues 12
1.4.1 Question classes 12
1.4.2 Question processing 13
1.4.3 Context and QA 13
1.4.4 Data sources for QA 13
1.4.5 Answer extraction 13
1.4.6 Answer formulation 13
1.4.7 Real time question answering 14
1.4.8 Multi-lingual (or cross-lingual) question answering 14
1.4.9 Interactive QA 14
1.4.10 Advanced reasoning for QA 14
1.4.11 User profiling for QA 14
1.5 A generic framework for QA 15
1.6 Evaluating QA Systems 15
1.6.1 End-to-End Evaluation 16
1.6.2 Mean Reciprocal Rank 16
1.6.3 Confidence Weighted Score 16
1.6.4 Accuracy and coverage 17
1.6.5 Traditional Metrics – Recall and Precision 17
Chapter2: Question Analysis 19
2.1 Determining the Expected Answer Type 19
2.1.1 Question Classes 19
2.1.2 Manually Constructed rules for question classification 20

2.1.3 Fully Automatically Constructed Classifiers 20
2.1.4 Support Vector Machines 21
2.1.5 Kernel Trick 22
2.1.6 Naive Bayes Classifier 22
2.1.7 Datasets 24
2.1.8 Features 24
2.1.9 Entropy and Weighted Feature Vector 25
2.1.10 Experiment Results 26
2.2 Query Formulation 27
2.2.1 Stop word for IR query formulation 28
Chapter3. Document Retrieval 29
3.1 Retrieval from local corpus 29
3.1.1 Ranking function 29
3.1.2 Okapi BM25 29
3.1.3 IDF Information Theoretic Interpretation 30
3.2 Information retrieval from the web 30
3.2.1 How many documents to retrieve? 31
Chapter4. Answer Extraction 34
4.1 Sentence Ranking 34
4.1.1 WordNet 34
4.1.2 Sense/Semantic Similarity between words 35
4.1.2 Sense Net ranking algorithm 36
Chapter5. Implementation and Results 38
5.1 Results 38
5.2 Comparisons with other Web Based QA Systems 41
5.3 Feasibility of the system to be used in real time environment 42
5.4 Conclusion 43

APPENDIX A : Web Based Question Set
APPENDIX B : Implementation Details


List of figures and tables
Figures PageNo.

Fig.1.1: A generic framework for question answering 15
Fig.1.2: Sections of a document collection as used for IR evaluation. 18
Fig.2.1: The kernel trick 22
Fig.2.2: Various feature sets extracted from a given question and its
corresponding part of speech tags.
Fig.2.3: Question type classifier performance 26
Fig.2.4: JAVA Question Classifier 27
Fig.3.1: Document retrieval framework 31
Fig.3.2: %coverage vs rank 32
Fig.3.3: %coverage vs. average processing time 33
Fig.4.1: Fragment of WordNet taxonomy 35
Fig.4.2: A sense network formed between a sentence and a query 36
Fig.4.3: A sample run for the question “Who performed the first human heart
Fig.5.1: Various modules of the QA system along with each ones basic task 38
Fig.9.2: Comparison with other web based QA systems 42
Fig.9.3: Time distribution of each module involved in QA 43

Tables PageNo.

Table 1.1 Coarse and fine grained question categories. 20
Table 2.1: performance of various query expansion modules implemented on
Table 3.1: %coverage and average processing time at different ranks 32
Table 5.1: Performance of the system on the web question set 39-41

Chapter1. Introduction
In information retrieval, question answering (QA) is the task of automatically answering
a question posed in natural language. To find the answer to a question, a QA computer
program may use either a pre-structured database or a collection of natural language
documents (a text corpus such as the World Wide Web or some local collection).
QA research attempts to deal with a wide range of question types including: fact, list,
definition, How, Why, hypothetical, semantically-constrained, and cross-lingual
questions. Search collections vary from small local document collections, to internal
organization documents, to compiled newswire reports, to the World Wide Web.
* Closed-domain question answering deals with questions under a specific domain
(for example, medicine or automotive maintenance), and can be seen as an easier task
because NLP systems can exploit domain-specific knowledge frequently formalized in
* Open-domain question answering deals with questions about nearly everything, and
can only rely on general ontologies and world knowledge. On the other hand, these
systems usually have much more data available from which to extract the answer.
(Alternatively, closed-domain might refer to a situation where only limited types of
questions are accepted, such as questions asking for descriptive rather than procedural
QA is regarded as requiring more complex natural language processing (NLP)
techniques than other types of information retrieval such as document retrieval, thus
natural language search engines are sometimes regarded as the next step beyond
current search engines.
1.1 History of Question Answering Systems
Some of the early AI systems were question answering systems. Two of the most famous
QA systems of that time are BASEBALL and LUNAR, both of which were developed in
the 1960s. BASEBALL answered questions about the US baseball league over a period of
one year. LUNAR, in turn, answered questions about the geological analysis of rocks
returned by the Apollo moon missions. Both QA systems were very effective in their
chosen domains. In fact, LUNAR was demonstrated at a lunar science convention in
1971 and it was able to answer 90% of the questions in its domain posed by people
untrained on the system. Further restricted-domain QA systems were developed in the
following years. The common feature of all these systems is that they had a core
database or knowledge system that was hand-written by experts of the chosen domain.
Some of the early AI systems included question-answering abilities. Two of the most
famous early systems are SHRDLU and ELIZA. SHRDLU simulated the operation of a
robot in a toy world (the "blocks world"), and it offered the possibility to ask the robot

questions about the state of the world. Again, the strength of this system was the choice
of a very specific domain and a very simple world with rules of physics that were easy
to encode in a computer program. ELIZA, in contrast, simulated a conversation with a
psychologist. ELIZA was able to converse on any topic by resorting to very simple rules
that detected important words in the person's input. It had a very rudimentary way to
answer questions, and on its own it lead to a series of chatter bots such as the ones that
participate in the annual Loebner prize.
The 1970s and 1980s saw the development of comprehensive theories in computational
linguistics, which led to the development of ambitious projects in text comprehension
and question answering. One example of such a system was the Unix Consultant (UC), a
system that answered questions pertaining to the Unix operating system. The system
had a comprehensive hand-crafted knowledge base of its domain, and it aimed at
phrasing the answer to accommodate various types of users. Another project was
LILOG, a text-understanding system that operated on the domain of tourism
information in a German city. The systems developed in the UC and LILOG projects
never went past the stage of simple demonstrations, but they helped the development
of theories on computational linguistics and reasoning.
In the late 1990s the annual Text Retrieval Conference (TREC) included a question-
answering track which has been running until the present. Systems participating in this
competition were expected to answer questions on any topic by searching a corpus of
text that varied from year to year. This competition fostered research and development
in open-domain text-based question answering. The best system of the 2004
competition achieved 77% correct fact-based questions.
In 2007 the annual TREC included a blog data corpus for question answering. The blog
data corpus contained both "clean" English as well as noisy text that include badly-
formed English and spam. The introduction of noisy text moved the question answering
to a more realistic setting. Real-life data is inherently noisy as people are less careful
when writing in spontaneous media like blogs. In earlier years the TREC data corpus
consisted of only newswire data that was very clean.
An increasing number of systems include the World Wide Web as one more corpus of
text. Currently there is an increasing interest in the integration of question answering
with web search. Ask.com is an early example of such a system, and Google and
Microsoft have started to integrate question-answering facilities in their search engines.
One can only expect to see an even tighter integration in the near future.
1.2 Architecture
The first QA systems were developed in the 1960s and they were basically natural-
language interfaces to expert systems that were tailored to specific domains. In
contrast, current QA systems use text documents as their underlying knowledge source
and combine various natural language processing techniques to search for the answers.
Current QA systems typically include a question classifier module that determines the
type of question and the type of answer. After the question is analyzed, the system

typically uses several modules that apply increasingly complex NLP techniques on a
gradually reduced amount of text. Thus, a document retrieval module uses search
engines to identify the documents or paragraphs in the document set that are likely to
contain the answer. Subsequently a filter preselects small text fragments that contain
strings of the same type as the expected answer. For example, if the question is "Who
invented Penicillin" the filter returns text that contain names of people. Finally, an
answer extraction module looks for further clues in the text to determine if the answer
candidate can indeed answer the question.
1.3 Question answering methods
QA is very dependent on a good search corpus - for without documents containing the
answer, there is little any QA system can do. It thus makes sense that larger collection
sizes generally lend well to better QA performance, unless the question domain is
orthogonal to the collection. The notion of data redundancy in massive collections, such
as the web, means that nuggets of information are likely to be phrased in many different
ways in differing contexts and documents, leading to two benefits:
(1) By having the right information appear in many forms, the burden on the QA
system to perform complex NLP techniques to understand the text is lessened.
(2) Correct answers can be filtered from false positives by relying on the correct
answer to appear more times in the documents than instances of incorrect ones.
1.3.1 Shallow
Some methods of QA use keyword-based techniques to locate interesting passages and
sentences from the retrieved documents and then filter based on the presence of the
desired answer type within that candidate text. Ranking is then done based on syntactic
features such as word order or location and similarity to query.
When using massive collections with good data redundancy, some systems use
templates to find the final answer in the hope that the answer is just a reformulation of
the question. If you posed the question "What is a dog?", the system would detect the
substring "What is a X" and look for documents which start with "X is a Y". This often
works well on simple "factoid" questions seeking factual tidbits of information such as
names, dates, locations, and quantities.
1.3.2 Deep
However, in the cases where simple question reformulation or keyword techniques will
not suffice, more sophisticated syntactic, semantic and contextual processing must be
performed to extract or construct the answer. These techniques might include named-
entity recognition, relation detection, co reference resolution, syntactic alternations,
word sense disambiguation, logic form transformation, logical inferences (abduction)

and commonsense reasoning, temporal or spatial reasoning and so on. These systems
will also very often utilize world knowledge that can be found in ontologies such as
WordNet, or the Suggested Upper Merged Ontology (SUMO) to augment the available
reasoning resources through semantic connections and definitions.
More difficult queries such as Why or How questions, hypothetical postulations,
spatially or temporally constrained questions, dialog queries, badly-worded or
ambiguous questions will all need these types of deeper understanding of the question.
Complex or ambiguous document passages likewise need more NLP techniques applied
to understand the text.
Statistical QA, which introduces statistical question processing and answer extraction
modules, is also growing in popularity in the research community. Many of the lower-
level NLP tools used, such as part-of-speech tagging, parsing, named-entity detection,
sentence boundary detection, and document retrieval, are already available as
probabilistic applications.
AQ (Answer Questioning) Methodology; introduces a working cycle to the QA methods.
This method may be used in conjunction with any of the known or newly founded
methods. AQ Method may be used upon perception of a posed question or answer. The
means by which it is utilized can be manipulated beyond its primary usage; however,
the primary usage is taking an answer and questioning it turning that very answer into
a question. Example; A"I like sushi." Q"(Why do) I like sushi(?)" A"The flavor." Q"(What
about) the flavor of sushi (do) I like?" Inadvertently, this may unveil different methods
of thinking and perception as well. While most would agree that this seems to be the
end-all stratagem, it is only a starting point with endless possibilities. Any number of
question methods may be used to derive the number of WHY as in, A = ∞(Q), the answer
may yield any number of questions to be asked; thereby unveiling an ongoing process
constantly being reborn into the research being performed. The QA methodology
utilizes just the opposite where, 1(Q) = ((∞(A)-∞) = 1(A), supposedly there is only one
true answer in reality everything else is perception or plausibility. Utilized alongside
other forms of communication; debate may be greatly improved. Even this methodology
should be questioned.
1.4 Issues
In 2002 a group of researchers wrote a roadmap of research in question answering. The
following issues were identified.
1.4.1 Question classes
Different types of questions require the use of different strategies to find the answer.
Question classes are arranged hierarchically in taxonomies.

1.4.2 Question processing
The same information request can be expressed in various ways - some interrogative,
some assertive. A semantic model of question understanding and processing is needed,
one that would recognize equivalent questions, regardless of the speech act or of the
words, syntactic inter-relations or idiomatic forms. This model would enable the
translation of a complex question into a series of simpler questions, would identify
ambiguities and treat them in context or by interactive clarification.
1.4.3 Context and QA
Questions are usually asked within a context and answers are provided within that
specific context. The context can be used to clarify a question, resolve ambiguities or
keep track of an investigation performed through a series of questions.
1.4.4 Data sources for QA
Before a question can be answered, it must be known what knowledge sources are
available. If the answer to a question is not present in the data sources, no matter how
well we perform question processing, retrieval and extraction of the answer, we shall
not obtain a correct result.
1.4.5 Answer extraction
Answer extraction depends on the complexity of the question, on the answer type
provided by question processing, on the actual data where the answer is searched, on
the search method and on the question focus and context. Given that answer processing
depends on such a large number of factors, research for answer processing should be
tackled with a lot of care and given special importance.
1.4.6 Answer formulation
The result of a QA system should be presented in a way as natural as possible. In some
cases, simple extraction is sufficient. For example, when the question classification
indicates that the answer type is a name (of a person, organization, shop or disease, etc),
a quantity (monetary value, length, size, distance, etc) or a date (e.g. the answer to the
question "On what day did Christmas fall in 1989?") the extraction of a single datum is
sufficient. For other cases, the presentation of the answer may require the use of fusion
techniques that combine the partial answers from multiple documents.

1.4.7 Real time question answering
There is need for developing Q&A systems that are capable of extracting answers
from large data sets in several seconds, regardless of the complexity of the question, the
size and multitude of the data sources or the ambiguity of the question.
1.4.8 Multi-lingual (or cross-lingual) question answering
The ability to answer a question posed in one language using an answer corpus in
another language (or even several). This allows users to consult information that they
cannot use directly. See also machine translation.
1.4.9 Interactive QA
It is often the case that the information need is not well captured by a QA system, as
the question processing part may fail to classify properly the question or the
information needed for extracting and generating the answer is not easily retrieved. In
such cases, the questioner might want not only to reformulate the question, but (s)he
might want to have a dialogue with the system.
1.4.10 Advanced reasoning for QA
More sophisticated questioners expect answers which are outside the scope of
written texts or structured databases. To upgrade a QA system with such capabilities,
we need to integrate reasoning components operating on a variety of knowledge bases,
encoding world knowledge and common-sense reasoning mechanisms as well as
knowledge specific to a variety of domains.
1.4.11 User profiling for QA
The user profile captures data about the questioner, comprising context data, domain
of interest, reasoning schemes frequently used by the questioner, common ground
established within different dialogues between the system and the user etc. The profile
may be represented as a predefined template, where each template slot represents a
different profile feature. Profile templates may be nested one within another.

1.5 A generic framework for QA
The majority of current question answering systems designed to answer factoid
questions consist of three distinct components:
1. question analysis,
2. document or passage retrieval and finally
3. answer extraction.
While these basic components can be further subdivided into smaller components like
query formation and document pre-processing, a three component architecture
describes the approach taken to building QA systems in the wider literature.

Fig.1.1: A generic framework for question answering.

It should be noted that while the three components address completely separate
aspects of question answering it is often difficult to know where to place the boundary
of each individual component. For example the question analysis component is usually
responsible for generating an IR query from the natural language question which can
then be used by the document retrieval component to select a subset of the available
documents. If, however, an approach to document retrieval requires some form of
iterative process to select good quality documents which involves modifying the IR
query, then it is difficult to decide if the modification should be classed as part of the
question analysis or document retrieval process.
1.6 Evaluating QA Systems
Evaluation is a highly subjective matter when dealing with NLP problems. It is always
easier to evaluate when there is a clearly defined answer, unfortunately with most of
the natural language tasks there is no single answer. A rather impractical and tedious
way of doing this could be to manually search an entire collection of text and mark the
Corpus or
Top n text

relevant documents. Then the queries can be used to make an evaluation based on
precision and recall. But this is not possible even for the smallest of document
collections and with the size of corpuses like AQUAINT with approximately 1,00,000
articles it is next to impossible.
1.6.1 End-to-End Evaluation
Almost every QA system is concerned with the final answer. So a widely accepted
metric is required to evaluate the performance of our system and compare it with other
existing systems. Most of the recent large scale QA evaluations have taken place as part
of the TREC conferences and hence the evaluation metrics used have been extensively
studied and is used in this study. Following are definitions of numerous metrics for
evaluating factoid questions. Evaluating descriptive questions is much more difficult
than factoids.
1.6.2 Mean Reciprocal Rank
The original evaluation metric used in the QA tracks of TREC 8 and 9 was mean
reciprocal rank (MRR). MRR provides a method for scoring systems which return
multiple competing answers per question. Let Q be the question collection and
r the
rank of the first correct answer to question i or 0 if no correct answer is returned. MRR
is then given by:

| |
| |
As useful as MRR was as an evaluation metric for the early TREC QA evaluations it does
have a number of drawbacks [8], the most important of which are that
- systems are given no credit for retrieving multiple (different) correct answers
- As the task required each system to return at least one answer per question; no
credit was given to systems for determining that they did not know or could not
locate an appropriate answer to a question.
1.6.3 Confidence Weighted Score
Following the shortcomings of MRR as an evaluation metric a new evaluation metric
was chosen as the new evaluation metric [9]. Under this evaluation metric a system
returns a single answer for each question. These answers are then sorted before
evaluation so that the answer which the system has most confidence in is placed first.

The last answer evaluated will therefore be the one the system has least confidence in.
Given this ordering CWS is formally defined in Equation 1.2:

| |
no. of correct in first i answers
| |
CWS therefore rewards systems which can not only provide correct exact answers to
questions but which can also recognise how likely an answer is to be correct and hence
place it early in the sorted list of answers. The main issue with CWS is that it is difficult
to get an intuitive understanding of the performance of a QA system given a CWS score
as it does not relate directly to the number of questions the system was capable of
1.6.4 Accuracy and coverage
Accuracy of a QA system is a simple evaluation metric with direct correspondence to
number of correct answers. Let
, D q
C be the correct answers for question q known to be
contained in the document collection D and
, ,
D q n
F be the first n answers found by
system S for question q from D then accuracy is defined as:
, , ,
| { | }|
( , , )
| |
D q n D q S
q Q
Q D n
o · =

Similarly The coverage of a retrieval system S for a question set Q and document
collection D at rank n is the fraction of the questions for which at least one relevant
document is found within the top n documents:

, , ,
| { | }|
coverage ( , , )
| |
D q n D q S
q Q R A
Q D n
o · =
1.6.5 Traditional Metrics – Recall and Precision
The standard evaluation measures for IR systems are precision and recall. Let D be the
document (or passage collection),
, D q
A the subset of Dwhich contains relevant
documents for a query q and
, ,
D q n
R be the n top-ranked documents (or passages) in D
retrieved by an IR system S (figure 1.2); then

The recall of an IR system S at rank n for a query q is the fraction of the relevant
, D q
A , which have been retrieved:


, , ,
| |
( , , )
| |
D q n D q S
D q
reca q l n
l D
The precision of an IR system S at rank n for a query q is the fraction of the retrieved
, ,
D q n
R that are relevant:

, , ,
, ,
| |
( , , )
| |
D q n D q S
D q n
D q n
= (1.6)
Clearly given a set of queries Q average recall and precision values can be calculated to
give a more representative evaluation of a specific IR system. Unfortunately these
evaluation metrics although well founded and used throughout the IR community suffer
from two problems when used in conjunction with the large document collections
utilized by QA systems, namely determining the set of relevant documents within a
collection for a given query,
, D q
A . The only accurate way to determine which documents
are relevant to a query is to read every single document in the collection and determine
its relevance. Clearly given the size of the collections over which QA systems are being
operated this is not a feasible proposition. It must be kept in mind that just because a
relevant document is found does not automatically mean the QA system will be able to
identify and extract a correct answer. Therefore it is better to use recall and precision at
the document retrieval stage rather than for the complete system.

Figure 1.2: Sections of a document collection as used for IR evaluation.
Document Collection/Corpus
Relevant Documents

, D q

Retrieved Documents
, ,
D q n

Chapter2. Question Analysis

As the first component in a QA system it could easily be argued that question analysis is
the most important part. Not only is the question analysis component responsible for
determining the expected answer type and for constructing an appropriate query for
use by an IR engine but any mistakes made at this point are likely to render useless any
further processing of a question. If the expected answer type is incorrectly determined
then it is highly unlikely that the system will be able to return a correct answer as most
systems constrain possible answers to only those of the expected answer type. In a
similar way a poorly formed IR query may result in no answer bearing documents being
retrieved and hence no amount of further processing by an answer extraction
component will lead to a correct answer being found.
2.1 Determining the Expected Answer Type
In most QA systems the first stage in processing a previously unseen question is to
determine the semantic type of the expected answer. Determining the expected answer
type for a question implies the existence of a fixed set of answer types which can be
assigned to each new question. The problem of question type classification can be
solved by constructing manual rules or if we have access to large set of annotated-pre
classified questions, using machine learning approaches. We have employed a machine
learning model in our system which employs a feature-weighting model which assigns
different weights to features instead of simple binary values. The main characteristic of
this model is assigning more reasonable weight to features: these weights can be used
to differentiate features from each other according to their contribution to question
classification. Further, we propose to use features initially just as bag of words and later
on both as a bag of words and feature called as partitioned feature model. Results show
that with this new feature-weighting model the SVM-based classifier outperforms the
one without it to a large extent.
2.1.1 Question Classes
We follow the two-layered question taxonomy, which contains 6 coarse grained
categories and 50 fine grained categories, as shown in Table 1. Each coarse grained
category contains a non-overlapping set of fine grained categories. Most question
answering systems use a coarse grained category definition. Usually the number of
question categories is less than 20. However, it is obvious that a fine grained category
definition is more beneficial in locating and verifying the plausible answers.

Table 1.1 Coarse and fine grained question categories.
Coarse Fine
ABBR abbreviation, expansion
DESC definition, description, manner, reason
ENTY animal, body, color, creation, currency, disease/medical,
event, food, instrument, language, letter, other, plant,
product, religion, sport, substance, symbol, technique, term,
vehicle, word
HUM description, group, individual, title
LOC city, country, mountain, other, state
NUM code, count, date, distance, money, order, other, percent,
period, speed, temperature, size, weight

2.1.2 Manually Constructed rules for question classification
Often the easiest approach to question classification is a set of manually constructed
rules. This approach allows a simple low coverage classifier to be rapidly developed
without requiring a large amount of hand labelled training data. A number of systems
have taken this approach, many creating sets of regular expressions which only
questions with the same answer type [10],[11]. While these approaches work well for
some questions (for instance questions asking for a date of birth can be reliably
recognised using approximately six well constructed regular expressions) they often
require the examination of a vast number of questions and tend to rely purely on the
text of the question. One possible approach for manually constructing rules for such a
classifier would be to define rule formalism that whilst retaining the relative simplicity
of regular expressions would give access to a richer set of features. As we had access to
large set of pre annotated question samples we have not used this method.
2.1.3 Fully Automatically Constructed Classifiers
As mentioned in the previous section building a set of classification rules to perform
accurate question classification by hand is both a tedious and time-consuming task. An
alternative solution to this problem is to develop an automatic approach to constructing
a question classifier using (possibly hand labelled) training data. A number of different
automatic approaches to question classification have been reported which make use of
one or more machine learning algorithms [6][7][12] including nearest neighbour (NN)
[4], decision trees (DT) and support vector machines (SVM)[7][12] to induce a classifier.
In our system we employed a SVM and Naive Bayes classifier on different feature sets
extracted from the question.

2.1.4 Support Vector Machines
Support vector machines (SVMs) are a set of related supervised learning methods used
for classification and regression. Viewing input data as two sets of vectors in an n-
dimensional space, an SVM will construct a separating hyper-plane in that space, one
which maximizes the margin between the two data sets. To calculate the margin, two
parallel hyperplanes are constructed, one on each side of the separating hyper-plane,
which are "pushed up against" the two data sets. Intuitively, a good separation is
achieved by the hyper-plane that has the largest distance to the neighboring data points
of both classes, since in general the larger the margin the lower the generalization error
of the classifier.
We are given some training data, a set of points of the form
D = ൛(ݔ
, ܿ
) | ݔ
߳ ℝ
, ܿ
߳ {−1,1}ൟ
where the ܿ
is either 1 or −1, indicating the class to which the point ݔ
belongs. Each ݔ

is a p-dimensional real vector. We want to give the maximum-margin hyperplane which
divides the points having ܿ
= 1 from those having ܿ
= − 1. Any hyperplane can be
written as the set of points ݔ satisfying
w ⋅ ݔ − ܾ = 0 (2.2)
where denotes the dot product. The vector w is a normal vector: it is perpendicular to
the hyperplane. The parameter ܾ/‖ݓ‖ determines the offset of the hyperplane from the
origin along the normal vector w. We want to choose the w and b to maximize the
margin, or distance between the parallel hyperplanes that are as far apart as possible
while still separating the data. These hyperplanes can be described by the equations
w ⋅ ݔ − ܾ = 1 (2.3)

w ⋅ ݔ − ܾ = −1 (2.4)
Note that if the training data are linearly separable, we can select the two hyperplanes
of the margin in a way that there are no points between them and then try to maximize
their distance. By using geometry, we find the distance between these two hyperplanes
is 2/‖ݓ‖, so we want to minimize ‖ݓ‖. As we also have to prevent data points falling
into the margin, we add the following constraint: for each i either
w ⋅ ݔ
− ܾ ≥ 1 (2.5)
w ⋅ ݔ
− ܾ ≤ −1 (2.6)
This can be rewritten as:

(w ⋅ ݔ
− ܾ) ≥ 1 , for all 1≤ ݅ ≤ ݊ (2.7)

We can put this together to get the optimization problem:
Minimize in (ݓ, ܾ) ‖ݓ‖ subject to (for any i = 1,……., n)
(w ⋅ ݔ
− ܾ) ≥ 1 (2.8)
2.1.5 Kernel Trick
If instead of the Euclidean inner product w ⋅ ݔ
one fed the QP solver with a function
K(w, ݔ
) the boundary between the two classes would then be,
K(x,w) + b = 0 (2.9)
and the set of x e R
on that boundary becomes a curved surface embedded in R
the function K(x,w) is non-linear.
Consider K(x,w) to be the inner product not of the coordinate vectors x and w in R
of vectors o(x) and o(w) in higher dimensions. The map, o: X ÷ H
is called a feature map from the data space X into the feature space H . The feature
space is assumed to be a Hilbert space of real valued functions defined on X . The data
space is often R
but most of the interesting results hold when X is a compact
Riemannian manifold. The following picture illustrates a particularly simple example
where the feature map o(x1,x2)=(x1
) maps data in R
into R

Figure 2.1: The kernel trick, after transformation the data is linearly separable.
2.1.6 Na ve Bayes Classifier

Along with SVM, we also tried Naïve Bayes Classifier[6]. A naive Bayes classifier is a
term in Bayesian statistics dealing with a simple probabilistic classifier based on

applying Bayes' theorem with strong (naive) independence assumptions. A more
descriptive term for the underlying probability model would be "independent feature
model". In simple terms, a naive Bayes classifier assumes that the presence (or absence)
of a particular feature of a class is unrelated to the presence (or absence) of any other
feature. For example, the words or features of a given question may are assumed to be
independent to simplify mathematical complexities. Even though these features depend
on the existence of the other features, a naive Bayes classifier considers all of these
properties to independently contribute to the probability that this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can be
trained very efficiently in a supervised learning setting. In many practical applications,
parameter estimation for naive Bayes models uses the method of maximum likelihood;
in other words, one can work with the naive Bayes model without believing in Bayesian
probability or using any Bayesian methods.
Abstractly, the probability model for a classifier is a conditional model

p(C| F_1,…….,F_n),

over a dependent class variable C with a small number of outcomes or classes,
conditional on several feature variables F_1 through F_n. The problem is that if the
number of features n is large or when a feature can take on a large number of values,
then basing such a model on probability tables is infeasible. We therefore reformulate
the model to make it more tractable.
Using Bayes' theorem, we write

p(C | F_1,…….,F_n) =
ܘ(۱) ܘ(۴

,……,۴_ܖ| ۱)


In plain English the above equation can be written as

posterior = (prior*likelihood)/evidence (2.11)

In practice we are only interested in the numerator of that fraction, since the
denominator does not depend on C and the values of the features F_i are given, so that
the denominator is effectively constant. The numerator is equivalent to the joint
probability model

p(C, F_1, ………, F_n),

which can be rewritten as follows, using repeated applications of the definition of
conditional probability:

p(C, F_1, ………., F_n)
= p(C) p(F_1,……….,F_n| C)
= p(C) p(F_1| C) p(F_2,.......,F_n| C, F_1)

= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3,.......,F_n| C, F_1, F_2)
= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3| C, F_1, F_2) p(F_4,.......,F_n| C, F_1, F_2, F_3)
= p(C) p(F_1| C) p(F_2| C, F_1) p(F_3| C, F_1, F_2) ...
.... p(F_n| C, F_1, F_2, F_3,.......,F_{n-1}) (2.12)
and so forth. Now the "naive" conditional independence assumptions come into play:
assume that each feature Fi is conditionally independent of every other feature Fj for j
i. This means that
p(F_i | C, F_j) = p(F_i | C), (2.13)
and so the joint model can be expressed as
p(C, F_1, ......., F_n) = p(C) p(F_1| C) p(F_2| C) p(F_3| C) …….. p(F_n| C)
= p(C)∏ p(F_i | C)
2.1.7 Datasets
We used the publicly available training and testing datasets provided by Tagged
Question Corpus, Cognitive Computation Group at the Department of Computer Science,
University of Illinois at Urbana-Champaign (UIUC) [5]. All these datasets have been
manually labelled by UIUC [5] according to the coarse and fine grained categories in
Table 1.1. There are about 5,500 labelled questions randomly divided into 5 training
datasets of sizes 1,000, 2,000, 3,000, 4,000 and 5,500 respectively. The testing dataset
contains 2000 labelled questions from the TREC QA track. The TREC QA data is hand
labelled by us.
2.1.8 Features
For each question, we extract two kinds of features: bag-of-words or a mix of POS tags
and words. Every question is represented as feature vectors; the weight associated with
each word varies between 0 and 1. The following example demonstrated different
feature sets considered for a given question and its POS parse.

Figure 2.2: Various feature sets extracted from the given question and its corresponding
part of speech tags.

2.1.9 Entropy and Weighted Feature Vector
In information theory the concept of entropy is used as a measure of the uncertainty of
a random variable. Let X be a discrete random variable with respect to alphabet A and
p(x) = Pr(X = x), x ∈ A be the probability function, then the entropy H(X) of the discrete
random variable X is defined as:

H(x) = −Σx∈A {p(x) log p(x)} (2.15)

The larger the entropy H(X) is, the more uncertain the random variable X is. In
information retrieval many methods have been applied to evaluate term’s relevance to
documents, among which entropy-weighting, based on information theoretic ideas, is
proved the most effective and sophisticated. Let fit be the frequency of word i in
document t, ni the total number of occurrences of word i in document collection, N the
number of total documents in the collection, then the confusion (or entropy) of word i
can be measured as follows:

H(i) = Σ(t=1 to N) [fit ni · log(nifit)] (2.16)

The larger the confusion of a word is, the less important it is. The confusion achieves
maximum value log(N) if the word is evenly distributed over all documents, and
minimum value 0 if the word occurs in only one document.
Keeping this in mind to calculate the entropy of a word, certain preprocessing is needed.
Let C be the set of question types. Without loss of generality, it is denoted by C = {1, . . .
,N}. Ci is a set of words extracted from questions of type i, that is to say, Ci represents a
word collection similar to documents. From the viewpoint of representation, each Ci is
the same as a document because both of which are just a collection of words. Therefore
we can also use the idea of entropy to evaluate word’s importance. Let ai be the weight
of word i, fit be the frequency of word i in Ct, ni be the total number of occurrences of
word i in all questions, then ai is defined as:

ai =(1 +1/log(N)Σ(t=1 to N)[fitni·log(fitni)]) (2.17)

Weight of word i is opposite to its entropy: the larger the entropy of word i is, the less
important to question classification it is. In other words, the smaller weight is
associated with word i. Consequently, ai get the maximum value of 1 if word i occurs in
only one set of question type, and the minimum value of 0 if the word is evenly
distributed over all sets. Note that if a word occurs in only one set, for other sets fik is 0.
We use the convention that 0 log 0 = 0, which is easily justified since xlogx → 0 as x → 0.

2.1.10 Experiment Results
We tested various algorithms for question classification.
Naïve Bayes Classifier* using Bag of Words feature set (64% accurate on TREC data),
Naïve Bayes Classifier* using Partitioned feature set (69% accurate on TREC data),
Support Vector Machine Classifier using Bag of Words feature set (78% accurate on
TREC data), Support Vector Machine Classifier using Weighted feature set (85%
accurate on TREC data)
It must be noted that the classifiers were NOT trained on TREC data. The classifier
classified questions into six broad classes and fifty coarse classes. Therefore a baseline
(random) classifier is (1/50) = 2% accurate. We employed various smoothing
techniques to Naive Bayes Classifier. The performance without smoothing was too low
and not worth mentioning. While Witten-Bell smoothing worked well, simple add one
smoothing outperformed it. The accuracy reported here are for Naive Bayes Classifier
employing add one smoothing.
We implemented weighted feature set SVM classifier into a cross platform standalone
desktop application (shown below). The application will be made available to public for
evaluation. Training was done on a set of 12788 questions provided by Cognitive
Computation Group at the Department of Computer Science, University of Illinois at

Figure 2.3: Classifiers were tested on a set of 2000 TREC questions.
Some sample test runs
Q: What was the name of the first Russian astronaut to do a spacewalk?
Response: HUM -> IND(an Individual)
Q: How much folic acid should an expectant mother get daily?
Response: NUM -> COUNT
Chart Showing accuracy of classifier
Baseline Classifier
Naïve Bayes Classifier using
Bag of Words feature
Naïve Bayes Classifier using
Partitioned feature
SVM Classifier using Bag of
Words feature
SVM Classifier using
Weighted feature set

Q: What is Francis Scott Key best known for?
Response: DESC -> DESC
Q: What state has the most Indians?
Response: LOC -> STATE
Q: Name a flying mammal.
Response: ENTITY -> ANIMAL

Figure 2.4: JAVA Question Classifier, can be downloaded for evaluation from
2.2 Query Formulation

The question analysis component of a QA system is usually responsible for formulating
a query from a natural language questions to maximise the performance of the IR
engine used by the document retrieval component of the QA system. Most QA systems
start constructing an IR query simply by assuming that the question itself is a valid IR
query, while other systems go for a query expansion. The design of the query expansion
module should be such as to maintain the right balance between recall and precision.
For large corpus, query expansion may not be necessary as even with not so well
formed query recall is sufficient to extract the right answer and query expansion may in
fact reduce precision. But in case of a local small corpus query expansion may be
necessary. In our system when using web as document collection, we pass on the
question as IR query after masking the stop words. When a web corpus is not available
we employ Rocchio Query Expansion [1] method which is implemented in lucene query
expansion module. The table below shows performance of various query expansion
modules implemented on Lucene. The test is carried out on data from NIST TREC
Robust Retrieval Track 2004

Tag Combined Topic Set
MAP P10 %no
Lucene QE 0.2433 0.3936 18.10%
Lucene gQE 0.2332 0.3984 14%
KB-R-FIS gQE 0.2322 0.4076 14%
Lucene 0.2 0.37 15%

MAP - mean average precision
P10 - average of precision at 10 documents retrieved
%no - percentage of topics with no relevant in the top 10 retrieved

Lucene QE - lucene with local query expansion
Lucene gQE – Lucene system that utilized Rocchio’s query expansion along with Google.
KB-R-FIS gQE – My Fuzzy Inference System that utilized Rocchio’s query expansion along with Google.

Table 2.1: performance of various query expansion modules implemented on Lucene.

It must be noted that query expansion is internally carried out by the APIs used to
retrieve documents from the web, although because of the proprietary nature their
working is unknown and unpredictable.
2.2.1 Stop word for IR query formulation
Stop words or noise words are words which appear with a high frequency and are
considered to be insignificant for normal IR processes. Unfortunately when it comes to
QA systems high frequency of a word in a collection may not always suggest that it is
insignificant in retrieving the answer. For example the word “first” is widely considered
to be a stop word but is very important when appears in the question “Who was the first
President of India?”. Therefore we manually analyzed 100 TREC QA track questions and
prepared a list of stop words. A partial list of the stop words is shown below.

I a about an are
As at be by com
De en for from how
In is it la of
On or that the this
To was what when where
Who will with und the

The list of stop words we obtained is much smaller than standard stop word lists
(although there is no definite list of stop words which all natural language processing
tools incorporate, most of these lists are very similar).

Chapter3. Document Retrieval

The text collection over which a QA system works tend to be so large that it is
impossible to process whole of it to retrieve the answer. The task of the document
retrieval module is to select a small set from the collection which can be practically
handled in the later stages. A good retrieval unit will increase precision while
maintaining good enough recall.
3.1 Retrieval from local corpus
All the work presented in this thesis relies upon the Lucene IR engine [13] for local
corpus searches. Lucene is an open source boolean search engine with support for
ranked retrieval results using a TF.IDF based vector space model. One of the main
advantages of using Lucene, over many other IR engines, is that it is relatively easy to
extend to meet the demands of a given research project (as an open source project the
full source code to Lucene is available making modification and extension relatively
straight forward) allowing experiments with different retrieval models or ranking
algorithms to use the same document index.
3.1.1 Ranking function
We employ highly popular Okapi BM25 [3] ranking function for our document retrieval
module. It is based on the probabilistic retrieval framework developed in the 1970s and
1980s by Stephen E. Robertson, Karen Spärck Jones, and others [14].
The name of the actual ranking function is BM25. To set the right context, however, it
usually referred to as "Okapi BM25", since the Okapi information retrieval system,
implemented at London's City University in the 1980s and 1990s, was the first system
to implement this function. BM25, and its newer variants, e.g. BM25F [2] (a version of
BM25 that can take document structure and anchor text into account), represent state-
of-the-art retrieval functions used in document retrieval, such as Web search.
3.1.2 Okapi BM25
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the
query terms appearing in each document, regardless of the inter-relationship between
the query terms within a document (e.g., their relative proximity). It is not a single
function, but actually a whole family of scoring functions, with slightly different
components and parameters. One of the most prominent instantiations of the function
is as follows. Given a query Q, containing keywords ݍ
,..., ݍ
, the BM25 score of a
document D is:

Score(D,Q) = ∑ ܫܦܨ(ݍ

where f(qi,D) is qi's term frequency in the document D, | D | is the length of the
document D in words, and avgdl is the average document length in the text collection
from which documents are drawn. k1 and b are free parameters, usually chosen as k1 =
2.0 and b = 0.75. IDF(qi) is the IDF (inverse document frequency) weight of the query
term qi. It is usually computed as:

) = ݈݋݃

where N is the total number of documents in the collection, and n(qi) is the number of
documents containing qi. There are several interpretations for IDF and slight variations
on its formula. In the original BM25 derivation, the IDF component is derived from the
Binary Independence Model.
3.1.3 IDF Information Theoretic Interpretation
Here is an interpretation from information theory. Suppose a query term q appears in
n(q) documents. Then a randomly picked document D will contain the term with
probability n(q)/N (where N is again the cardinality of the set of documents in the
collection). Therefore, the information content of the message "D contains q" is:

= log

Now suppose we have two query terms q1 and q2. If the two terms occur in documents
entirely independently of each other, then the probability of seeing both q1 and q2 in a
randomly picked document D is:


and the information content of such an event is:

With a small variation, this is exactly what is expressed by the IDF component of BM25.
3.2 Information retrieval from the web
Indexing the whole web is a gigantic task which is not possible on a small scale.
Therefore we use public APIs of search engines. We have used Google AJAX Search API
and Yahoo BOSS. Both APIs have relaxed terms of condition and allow access through
code. Moreover there are no limits on number of queries per day when used for

educational purposes. The search APIs can return top n documents for a given query.
We read top n uniform resource locators (URLs) and build the collection of documents
to be used for answer retrieval. As the task of reading the URLs over the internet is
inherently slow process, this stage is the most taxing one in terms of runtime. To
accelerate the process we employ multi threaded URL readers so that multiple URLs can
be read simultaneously. Figure 3.1 shows the document retrieval framework.

Figure 3.1: Document retrieval framework
3.2.1 How many documents to retrieve?

One of the main considerations when doing document retrieval for QA is the amount of
text to retrieve and process for each question. Ideally a system would retrieve a single
text unit that was just large enough to contain a single instance of the exact answer for
every question. Whilst the ideal is not attainable, the document retrieval stage can act as
a filter between the document collections/web and answer extraction components by
retrieving a relatively small set of text collection. Therefore our target is to increasing
coverage with least number of retrieved documents to form the text collection. Lowered
precision is penalized by higher average processing time by later stages. Therefore,
IR Query
Lucene IR
Search APIs
Okapi BM25
Ranking function
URL Reader
URL Reader
URL Reader
URL Reader
Top n
Multi threaded
Reader module


criterion for selecting the right collection size depends on coverage and average
processing time. The table below shows percentage coverage, average processing time
at different ranks for Google and Yahoo search APIs. The results are obtained on a set of
30 questions (equally distributed over all question classes) from TREC 04 QA track [5].

%Coverage @rank
Yahoo Google Yahoo Google
0.02 0.021 23 28 1
0.102 0.09 31 48 2
0.27 0.23 37 56 3
0.34 0.37 42 58 4
0.49 0.51 48 64 5
0.82 0.803 49 64 6
1.23 1.1 49 64 7
1.44 1.39 51 66 8
2.01 1.9 51 70 9
2.31 2.2 52 72 10
2.8 2.6 53 72 11
3.22 3.1 53 73 12
3.7 3.4 54 73 13
4.2 4.6 54 73 14
4.77 5.1 55 74 15

*Average time spent by answer retrieval node.

Table 3.1: %coverage and average processing time at different ranks

Figure 3.2: %coverage vs rank
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Yahoo BOSS API 23 31 37 42 48 49 49 51 51 52 53 53 54 54 55
Google AJAX Search
28 48 56 58 64 64 64 66 70 72 72 73 73 73 74
%Coverage vs rank


Figure 3.3: %coverage vs. average processing time

From the results it is clear that going up to rank 5 ensures a good coverage while
maintaining low processing time. Clearly Google outperforms Yahoo at all ranks.

%Coverage vs Average Processing time(sec)
Google AJAX Search
Yahoo BOSS

Chapter4. Answer Extraction

The final stage in a QA system, and arguably the most important, is to extract and
present the answers to questions. We employ a named entity (NE) recognizer to filter
out those sentences which could potentially contain answer to the given question. In
our system we have used GATE – A General Architecture for Text Engineering provided
by The Sheffield NLP group [15] as a tool to handle most of the NLP tasks including NE
4.1 Sentence Ranking
The sentence ranking is responsible for ranking the sentences and giving a relative
probability estimate to each one. It also registers the frequency of each individual
phrase chunk marked by the NE recognizer for a given question class. The final answer
is the phase chunk with maximum frequency belonging to the sentence with highest
rank. The probability estimate and the retrieved answer’s frequency are used to
compute confidence of the answer.
4.1.1 WordNet
WordNet [16] is the product of a research project at Princeton University which has
attempted to model the lexical knowledge of a native speaker of English. In WordNet
each unique meaning of a word is represented by a synonym set or synset. Each synset
has a gloss that defines the concept of the word. For example the words car, auto,
automobile, and motorcar is a synset that represents the concept define by gloss: four
wheel Motor vehicle, usually propelled by an internal combustion Engine. Many glosses
have examples of usages associated with them, such as "he needs a car to get to work."
In addition to providing these groups of synonyms to represent a concept, WordNet
connects concepts via a variety of semantic relations. These semantic relations for
nouns include:
- Hyponym/Hypernym (IS-A/ HAS A)
- Meronym/Holonym (Part-of / Has-Part)
- Meronym/Holonym (Member-of / Has-Member)
- Meronym/Holonym (Substance-of / Has-Substance)

Figure 4.1 shows a fragment of WordNet taxonomy.

4.1.2 Sense/Semantic Similarity between words
We use statistics to compute information content value. We assign a probability to a
concept in taxonomy based on the occurrence of target concept in a given corpus.
The IC value is then calculated by negative log likelihood formula as follow:

( ) log( ( )) IC c P c = ÷
Where c is a concept and p is the probability of encountering c in a given corpus. Basic
idea behind the negative likelihood formula is that the more probable a concept
appears, the less information it conveys, in other words, infrequent words are more
informative then frequent ones. Using this basic idea we compute the sense/semantic
similarity ૆ between two given words based on a similarity metric proposed by Philip
Resnik [17].

Figure 4.1: Fragment of WordNet taxonomy

4.1.2 Sense Net ranking algorithm
We consider the sentence under consideration and the given query to be a set of words
similar to a bag of word model. But unlike a bag of word model we give importance to
the order of the words. Stop words are rejected from the set and only the root forms of
the words are taken into account. If W is the ordered set of n words in the given
sentence and Q is the ordered set m words in the query, then we compute a network of
sense weights between all pair of words in W and Q. Therefore we define the sense
network ) ( ,
j i
w q I as:

) ( ,
i j i j
w q c I =
i j
c ‹ is the value of sense/semantic similarity between and
i j
w W q Q ‹ ‹ .

…… ࢝

) ( ,
j i
w q I

…… ࢗ

Figure 4.2: A sense network formed between a sentence and a query.

Given a sense network ) ( ,
i j
w q I , we define the distance of a word
w as
( )
d w i = (4.3)
( )
d q j = (4.4)
Word with maximum sense similarity with query word
q is:

( | ) ar ax gm
i j j j i
w j M q c = = (4.5)
And the corresponding value of
, i j
c = ( )
V q (4.6)
The exact match score is
∑ ܸ(ݍ

Average sense similarity for query word ݍ
with sentence W is
) =
∑ ξ
݆ ,݅ ݆

Therefore the total average sense per word is
∑ ܵ(ݍ
∑ ∑ ξ
݅,݆ ݆ ݅
Let T = {ordered set of M(ݍ
) ∀ ݅ ∈ [1, ݉]} in increasing order of d(q). Function ࣂ

is the
distance of i
element in ܶ then the alignment score is
∑ ݏ݃݊(ߠ

The total average noise is defined as

*( * )
n E m
÷ ÷
= ÷
Where o is the noise decay factor.
Now, ߤ = noise penalty coefficient ߰ = ݁ݔܽܿݐ ݉ܽݐܿℎ ܿ݋݂݂݁݅ܿ݅݁݊ݐ
ߣ = ݏ݁݊ݏ݁ ݏ݈݅݉݅ܽݎ݅ݐݕ ܿ݋݂݂݁݅ܿ݅݁݊ݐ ߥ = ݋ݎ݀݁ݎ ܿ݋݂݂݁݅ܿ݅݁݊ݐ
Total score

ࣁ = ࣒× ࡱ
+ ࣅ × ࡿ
+ ࣆ × ࢾ

ࣇ × ࢑

The coefficients are fine tuned depending on the type of corpus. Unlike newswire data
most of the information found on the internet is badly formatted, grammatically
incorrect and most of the time not well formed. So when web is used as the knowledge
base we use the following values of different coefficients: ߤ = 1.0, ߰ = 1.0, ߣ = 0.25 ,
ߥ = 0.125 and noise decay factor o =0.25 but when using local corpus we reduce ߤ to
0.5 and o to 0.1. Once we obtain the total score for each sentence, we sort then
according to these scores. We take top t sentences and consider the plausible answers
within them. If an answer appears with frequency f in sentence ranked r then that
answer gets a confidence score

(1 l ) ) ) ( n( C ans f
+ =
Again all answers are sorted according to confidence score and top 0 (=5 in our case)
answers are returned along with corresponding sentence and URL (figure 4.3).

Figure 4.3: A sample run for the question “Who performed the first human heart

Chapter5. Implementation and Results

Our question answering module is written in JAVA. Use of JAVA makes the software
cross platform and highly portable. It uses various third party APIs for NLP and text
engineering; GATE, Stanford parser, Json and Lucene API to name a few. Each module is
designed keeping in mind space and time constraints. The URL reader module is multi
threaded to keep download time at the minimum. Most of the pre-processing is done via
GATE processing pipeline. More information is provided in appendix B.

Figure 5.1: Various modules of the QnA system along with each ones basic task.
5.1 Results
The idea of building an easily accessible question answering system which uses the web
as a document collection is not new. Most of these systems are accessed via a web
browser. In the later part of the section we compare our system with other web QA
systems. The tests were performed on a small set of fifty web based questions. The
reason we did not use questions from TREC QA is that the TREC questions are now
appearing quite frequently (sometimes with correct answers) in the results of web
search engines. This could have affected the results of any web based study. For this
reason a new collection of fifty questions was assembled to serve as the test set. Also we
don’t have access to AQUAINT corpus which is the knowledgebase for TREC QA systems.
The questions within the new test set were chosen to meet the following criteria:
Multi threaded URL reader implementation
Multi threaded URL reader interface
Stopwords filter class
Computes Sense/Semantic similarity between words
Stores a generic URL along with number of attempts to read it
Trains the weighted feature vector SVM classifier
Load GATE processing resources.
Implements a standard porter stemmer
main class that handles user queries.
Uses Google and Yahoo search engine queries to build the corpus
Sense Net implementation
ArrayList of Ranked Sentences with helper methods
Weighted feature vector SVM classifier.

1. Each question should be an unambiguous factoid question with only one known
answer. Some of the questions chosen do have multiple answers although this is
mainly due to incorrect answers appearing in some web documents.
2. The answers to the questions should not be dependent upon the time at which
the question is asked. This explicitly excludes questions such as “Who is the
President of the US?”
These questions are provided in appendix A.
For each question in set, the table below shows the (min) rank at which answer was
obtained. In case the system fails to answer a question we show the reason it failed. Also
time spent on various tasks is shown which would help in determining the feasibility of
the system to be used in real time environment. We used top 5 documents to construct
our corpus which restricts our coverage to 64%. In a way 64% is the accuracy upper
bound of our system.

@ rank
Remarks Time in seconds

1 5 8.5 13 0.82
2 NA NE recognizer not
designed to handle
this question.
0 0 0
3 1 11 9.77 0.38
4 4 8.6 10.23 0.41
5 1 6.4 13.33 0.55
6 1 7.8 15 0.51
7 NA NE recognizer not
designed to handle
this question.
0 0 0
8 1 4.1 16.3 1.1
9 1 5.2 11.8 0.43
10 1 6.4 12.23 0.61
11 NA Question Classifier
0 0 0
12 3 8.0 14.5 0.2
13 1 7.37 11.2 0.71
14 1 8.1 15.7 0.88
15 NA Incorrect Answer 6.54 13.5 0.47
16 1 6.9 11.78 0.53
17 5 6.2 17.2 0.91

18 1 7.1 14.63 0.42
19 2 6.99 16.1 0.54
20 1 8.2 12.31 0.45
21 NA NE recognizer not
designed to handle
this question.
0 0 0
22 1 7.66 11.9 0.61
23 NA NE recognizer not
designed to handle
this question.
0 0 0
24 NA NE recognizer not
designed to handle
this question.
0 0 0
25 1 Answer changed
11.2 14.7 0.62
26 NA Incorrect Answer 5.5 8 0.23
27 NA NE recognizer not
designed to handle
this question.
0 0 0
28 1 11.7 15.1 0.58
29 1 6.9 10.67 0.43
30 1 7.9 13.83 0.67
31 1 Incorrect Answer 6.67 11.5 0.47
32 4 7.23 14.67 0.65
33 1 7.21 16.23 0.61
34 1 Incorrect Answer 6.8 11.21 0.34
35 1 7.4 12.0 0.36
36 1 8.01 14.8 0.59
37 NA Incorrect Answer 8.11 14.99 0.64
38 NA Incorrect Answer 8.23 11.01 0.34
39 1 6.77 10.2 0.41
40 NA Incorrect Answer 8.4 16.3 0.79
41 1 9.1 11.4 0.53
42 NA Incorrect Answer 6.7 8.22 0.23
43 1 7.8 14.3 0.43
44 NA Incorrect Answer 9.2 16.1 0.62
45 1 7.2 13.8 0.48
46 1 11.2 15.3 0.54
47 NA Incorrect Answer 7.1 12.67 0.38
48 NA Incorrect Answer
mainly because req.
6.99 11.11 0.29

answer type was
present in the query
49 2 8.01 12.51 0.46
50 NA Incorrect Answer 7.67 11.02 0.33
Average time spent: 6.6 11.24 0.45
Total number of questions: 50; Number of questions [email protected]
- Rank 1: 26 – Accuracy 52%
- Rank 2: 28 – Accuracy 56%
- Rank 3: 29 – Accuracy 58%
- Rank 4: 31 – Accuracy 62%
- Rank 5: 32 – Accuracy 64%
Average time spent per question: 18.3 seconds
#time is dependent on network speed

Table 5.1: Performance of the system on the web question set.

As seen, most of the failures were because of the handicapped NE recognizer. The
question classifier failed in only one instance. @Rank 5 the system reached its accuracy
upper bound of 64%.
5.2 Comparisons with other Web Based QA Systems
We compare our system with four web based QA Systems – AnswerBus [18],
AnswerFinder, IONAUT [19] and PowerAnswer[20]. The consistently best performing
system at TREC forms the backbone of the PowerAnswer system from Language
. Unlike our system each answer is a sentence and no attempt is made to
cluster (or remove) sentences which contain the same answer. This gives undue
advantage to the system as it performs the easier task of finding relevant sentences
only. The system called AnswerBus
[18] behaves in much the same way as
PowerAnswer, returning full sentences containing duplicated answers. It is claimed that
AnswerBus can correctly answer 70.5% of the TREC 8 question set although we believe
the performance would decrease if exact answers were being evaluated as experience of
the TREC evaluations has shown this to be a harder task than locating answer bearing
sentences. IONAUT
[19] uses its own crawler to index the web with specific focus on
entities and the relationships between them in order to provide a richer base for
answering questions than the unstructured documents returned by standard search
engines. The system returns both exact answers and snippets. AnswerFinder is a client
side application that supports natural language questions and queries the Internet via
Google. It returns both exact answer and snippets. This system is the closest to ours.

1. http://www.languagecomputer.com/demos/
2. http://misshoover.si.umich.edu/˜zzheng/qa-new/
3. http://www.ionaut.com:8400

The questions from the web question set were presented to the five systems on the
same day, within as short a period of time as was possible, so that the underlying
document collection, in this case the web would be relatively static and hence no system
would benefit from subtle changes in the content of the collection.

Figure 9.2: Comparison of AnswerBus , AnswerFinder , IONAUT +, PowerAnswer
and our system

It is clear from the graph that our system outperforms all but AnswerFinder at rank 1.
This is quite important as the answer returned at rank 1 can be considered to be the
final answer provided by the system. At higher ranks it performs considerably better
than AnswerBus and IONAUT while performing marginally less than AnswerFinder and
PowerAnswer. The results are encouraging but it should be noted that due to the small
number of test questions it is difficult to draw firm conclusions from these experiments.
5.3 Feasibility of the system to be used in real time environment

From table 5.1 it is clear that the system cannot be used for real time purposes as of
now. An average response time of 18.3 seconds is too high. But it must be noted that

document retrieval time will be significantly lower for offline – local corpus. More over
the task of post processing can be done offline on the corpus as it is independent of the
query. Once the corpus is pre-processed offline, the actual task of retrieving an answer
is quite low at 0.45 seconds. We believe that if we use our own crawler and pre-process
the documents beforehand, our system can retrieve answers fast enough to be used in
real time systems. The graph below shows percentage of time spent in different tasks.

Figure 9.3: Time distribution of each module involved in QA
5.4 Conclusion
The main motivation behind the work in this thesis was to consider, where possible,
simple approaches to question answering which can be both easily understood and
would operate quickly. We observed that the performance of the system is limited by
the worst performing module of the QA system. So even if a single module fails the
whole system won’t be able to answer. In our case the NE recognizer is the weakest
link. Our NE recognizer recognizes limited sets of answer types which is not enough to
obtain a good enough overall accuracy. We employed machine learning techniques for
question classification whose performance is good enough and any further
improvements won’t be beneficial. We also proposed the Sense Net algorithm as new
way of ranking sentences and answers. Even with the limited capability of NE
recognizer the system is at par with state of the art web QA systems which confirms the
efficacy of the ranking algorithm. The time distribution of various modules shows that
the system is quite fast at the answer extraction stage, if used along with a local corpus
which is pre-processed offline it can be adapted for real time applications. Finally our
current results are encouraging but we acknowledge that due to the small number of
test questions it is difficult to draw firm conclusions from these experiments.
time distribution

Appendix A
Small Web Based Question Set
Q001: The chihuahua dog derives it’s name from a town in which country? Ans: Mexico
Q002: What is the largest planet in our Solar System? Ans: Jupiter
Q003: In which country does the wild dog, the dingo, live? Ans: Australia or America
Q004: Where would you find budgerigars in their natural habitat? Ans: Australia
Q005: How many stomachs does a cow have? Ans: Four or one with four parts
Q006: How many legs does a lobster have? Ans: Ten
Q007: Charon is the only satellite of which planet in the solar system? Ans: Pluto
Q008: Which scientist was born in Germany in 1879, became a Swiss citizen in 1901 and
later became a US citizen in 1940? Ans: Albert Einstein
Q009: Who shared a Nobel prize in 1945 for his discovery of the antibiotic penicillin?
Ans: Alexander Fleming, Howard Florey or Ernst Chain
Q010: Who invented penicillin in 1928? Ans: Sir Alexander Fleming
Q011: How often does Haley’s comet appear? Ans: Every 76 years or every 75 years
Q012: How many teeth make up a full adult set? Ans: 32
Q013: In degrees centigrade, what is the average human body temperature? Ans: 37, 38
or 37.98
Q014: Who discovered gravitation and invented calculus? Ans: Isaac Newton
Q015: Approximately what percentage of the human body is water? Ans: 80%, 66%,
60% or 70%
Q016: What is the sixth planet from the Sun in the Solar System? Ans: Saturn
Q017: How many carats are there in pure gold? Ans: 24
Q018: How many canine teeth does a human have? Ans: Four
Q019: In which year was the US space station Skylab launched? Ans: 1973
Q020: How many noble gases are there? Ans: 6
Q021: What is the normal colour of sulphur? Ans: Yellow
Q022; Who performed the first human heart transplant? Ans: Dr Christiaan Barnard
Q023: Callisto, Europa, Ganymede and Io are 4 of the 16 moons of which planet? Ans:
Q024: Which planet was discovered in 1930 and has only one known satellite called
Charon? Ans: Pluto
Q025: How many satellites does the planet Uranus have? Ans: 15, 17, 18 or 21
Q026: In computing, if a byte is 8 bits, how many bits is a nibble? Ans: 4
Q027: What colour is cobalt? Ans: blue
Q028: Who became the first American to orbit the Earth in 1962 and returned to Space
in 1997? Ans: John Glenn
Q029: Who invented the light bulb? Ans: Thomas Edison

Q030: How many species of elephant are there in the world? Ans: 2
Q031: In 1980 which electronics company demonstrated its latest invention, the
compact disc? Ans: Philips
Q032: Who invented the television? Ans: John Logie Baird
Q033: Which famous British author wrote ”Chitty Chitty Bang Bang”? Ans: Ian Fleming
Q034: Who was the first President of America? Ans: George Washington
Q035: When was Adolf Hitler born? Ans: 1889
Q036: In what year did Adolf Hitler commit suicide? Ans: 1945
Q037: Who did Jimmy Carter succeed as President of the United States? Ans: Gerald
Q038: For how many years did the Jurassic period last? Ans: 180 million, 195 – 140
million years ago, 208 to 146 million years ago, 205 to 140 million years ago, 205 to 141
million years ago or 205 million years ago to 145 million years ago
Q039: Who was President of the USA from 1963 to 1969? Ans: Lyndon B Johnson
Q040: Who was British Prime Minister from 1974-1976? Ans: Harold Wilson
Q041: Who was British Prime Minister from 1955 to 1957? Ans: Anthony Eden
Q042: What year saw the first flying bombs drop on London? Ans: 1944
Q043: In what year was Nelson Mandela imprisoned for life? Ans: 1964
Q044: In what year was London due to host the Olympic Games, but couldn’t because of
the Second World War? Ans: 1944
Q045: In which year did colour TV transmissions begin in Britain? Ans: 1969
Q046: For how many days were US TV commercials dropped after President Kennedy’s
death as a mark of respect? Ans: 4
Q047: What nationality was the architect Robert Adam? Ans: Scottish
Q048: What nationality was the inventor Thomas Edison? Ans: American
Q049: In which country did the dance the fandango originate? Ans: Spain
Q050: By what nickname was criminal Albert De Salvo better known? Ans: The Boston

Appendix B
Implementation Details
We have used Jcreator (http://www.jcreator.com/ ) as the preferred IDE. The code uses
newer features like generics which is not compatible with any version of JAVA prior to
1.5. The following third party APIs are used:

- GATE 4.0 (A General Architecture for Text Engineering) software toolkit
originally developed at the University of Sheffield since 1995 - http://gate.ac.uk/
- Apache Lucene API is a free/open source information retrieval library, originally
created in Java by Doug Cutting - http://lucene.apache.org/
- JSON API, JSON or JavaScript Object Notation, is a lightweight computer data
interchange format. The API brings support to read JSON data. -
- LibSVM A Library for Support Vector Machines by Chih-Chung Chang and Chih-
Jen Lin - http://www.csie.ntu.edu.tw/~cjlin/libsvm/
- JWNL is an API for accessing WordNet in multiple formats, as well as relationship
discovery and morphological processing -
- Stanford Log-linear Part-Of-Speech Tagger -
- WordNet 2.1 is a lexical database for the English language is used to measure
sense/semantic similarity measure - http://wordnet.princeton.edu/

All experiments performed on a Core 2 Duo 1.86 GhZ System with 2GB RAM. Default
stack size may not be sufficient to run the application. Therefore stack size should be
increased to at least 512MB using –Xmx512m command line option. Some classes
present in JWNL API conflict with GATE. To resolve the issue conflicting libraries
belonging to GATE must not be included in the classpath.

[1] Miles Efron. Query expansion and dimensionality reduction: Notions of
optimality in rocchio relevance feedback and latent semantic indexing.
Information Processing & Management, 44(1):163–180, January 2008.
[2] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu,
and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text
Retrieval Conference.
[3] Stephen E. Robertson and Steve Walker. 1999. Okapi/Keenbow at TREC-8. In
Proceedings of the 8th Text REtrieval Conference.
[4] Tom M. Mitchell. 1997. Machine Learning. Computer Science Series. McGraw-Hill.
[5] Corpora for Question Answering Task, Cognitive Computation Group at the
Department of Computer Science, University of Illinois at Urbana-Champaign.
[6] Xin Li and Dan Roth. 2002. Learning Question Classifiers. In Proceedings of the
International Conference on Computational Linguistics (COLING’02), Taipei,
[7] Kadri Hacioglu and Wayne Ward. 2003. Question Classification with Support
Vector Machines and Error Correcting Codes. In Proceedings of the 2003
Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology (NAACL ’03), pages 28–30,
Morristown, NJ, USA.
[8] Ellen M. Voorhees. 1999. The TREC 8 Question Answering Track Report. In
Proceedings of the 8th Text REtrieval Conference.
[9] Ellen M. Voorhees. 2002. Overview of the TREC 2002 Question Answering Track.
In Proceedings of the 11th Text REtrieval Conference.
[10] Eric Breck, John D. Burger, Lisa Ferro, David House, Marc Light, and Inderjeet
Mani. 1999. A Sys Called Qanda. In Proceedings of the 8th Text REtrieval
[11] Richard J. Cooper and Stefan M. R¨uger. 2000. A Simple Question Answering
System. In Proceedings of the 9th Text REtrieval Conference.
[12] Dell Zhang and Wee Sun Lee. 2003. Question Classification using Support Vector
Machines. In Proceedings of the 26th ACM International Conference on Research

and Developement in Information Retrieval (SIGIR’03), pages 26–32, Toronto,
[13] Hao Wu, Hai Jin, and Xiaomin Ning. An approach for indexing, storing and
retrieving domain knowledge. In SAC ’07: Proceedings of the 2007 ACM
symposium on Applied computing, pages 1381–1382, New York, NY, USA, 2007.
ACM Press.
[14] Karen S. Jones, Steve Walker, and Stephen E. Robertson. A probabilistic model
of information retrieval: development and comparative experiments - part 2.
Information Processing and Management, 36(6):809–840, 2000.
[15] H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. Software
infrastructure for natural language processing, 1997.
[16] George A. Miller. 1995. WordNet: A Lexical Database. Communications of the
ACM, 38(11):39–41, November.
[17] Philip Resnik. Semantic similarity in a taxonomy: An information-based
measure and its application to problems of ambiguity in natural language.
Journal of Artificial Intelligence Research, 11:95–130, 1999.
[18] Zhiping Zheng. 2002. AnswerBus Question Answering System. In Proceedings
of the Human Language Technology Conference (HLT 2002), San Diego, CA,
March 24-27.
[19] Steven Abney, Michael Collins, and Amit Singhal. 2000. Answer Extraction. In
Proceedings of the 6th Applied Natural Language Processing Conference (ANLP
2000), pages 296–301, Seattle, Washington, USA.
[20] Dan Moldovan, Sanda Harabagiu, Roxana Girju, Paul Mor?arescu, Finley
Lacatus¸u, Adrian Novischi, Adriana Badulescu, and Orest Bolohan. 2002. LCC
Tools for Question Answering. In Proceedings of the 11th Text REtrieval

Sponsor Documents

Or use your account on DocShare.tips


Forgot your password?

Or register your new account on DocShare.tips


Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in