Bsc Thesis part1 - WEB BASED INFORMATION RETRIEVAL

Published on December 2016 | Categories: Documents | Downloads: 22 | Comments: 0 | Views: 490

of 26

Web World Wide contains large sets of information. This characteristic of Web however, can become a real pain for users who seek sources that would be qualitative and relative, at the same time, to their informative needs. In this Final Year project we try to examine some information re trieval methods over web stored information. The main focus is given on if and how software agents could potentially enhance the information re trieval process.Another topic that we examine in this final year project is the require ments, phases and evaluation process that are necessary in software de sign & production process.

Content

WEB BASED INFORMATION
RETRIEVAL
by
Theodoros Pitikaris
A thesis submitted in partial fulfill-
ment of the requirements for the
degree of:
BSc in Computing and
Information Technology
Department of Computing
University of Surrey
UNIVERSITY OF SURREY
ABSTRACT
WEB BASED INFORMATION RE-
TRIEVAL
by
Theodoros Pitikaris
Supervisory Committee: Dr . Bogdan Vrusias
Department of Computing
Dr. Nick Antonopoulos
Department of Computing
Web World Wide contains large sets of information. This characteristic of
Web however, can become a real pain for users who seek sources that
would be qualitative and relative, at the same time, to their informative
needs. In this Final Year project we try to examine some information re-
trieval methods over web stored information. The main focus is given on if
and how software agents could potentially enhance the information re-
trieval process.
Another topic that we examine in this final year project is the require-
ments, phases and evaluation process that are necessary in software de-
sign & production process.
Table of Contents
Introduction .............................. . ....................... . ................................ 1
Final year project objectives .............................. .. .. ......................... 3
Final year Project Structure .................. . ....................... .. .... . ............. .. ... ..... 3
Chapter 1. ANALYSIS I- LITERATURE REVIEW .... .. ......................... .. 5
GOOGLE search engine ......... ................................................ ......... 5
Text Retrieval Methods .............................................................. . .. 11
Natural Language processing ................................. . .............. . ....... . ...... ... .. 15
Neural Network as infrastructure in retrieval. ......... . ................ . .. . .... . .. . ........ 16
Latent Semantic Indexing ..................... . .... . ..... .... . ................... . .. .... .. ........ 16
Latent Semantic Algorithm .... .. .. ...... ....... . ...................... . ... .... .. ................. 17
Advantages of Neural Network Models over Traditional IR Models .... .... ...... ... 17
Special issues on web Information Retrieval ................. . .. . .. .... . .. . ....... . .. ..... . 18
TheAagent's Technology ........ ... .. .. . ... ...... . ..... ... .. ... ..... ........... . ....... 19
Introduction ........... ... ............... ..... ........... . .. ... ... .... ... ....... ... ... .. ...... . .. ....... 19
Categories of agents in more details ............................... . .. .... ............ ...... .. 20
Chapter 2. SYSTEM Development Process .................................. . .... 22
Definition of software development process ................ .... . .... .. ... ... .. . 22
System Development Life Cycle (SDLC) ...................... . .. . ...... .. ... .... 23
Agile Software Development in details ................ ... . ..... . ... .... . .... .. .. .. ... ........ 27
General Characteristics of SDLC ................................. . ... . .. .... . .... .. . 32
Requirement Gathering and Prioritization ... ................................... . 34
Software requirements analysis . .... . ..... ... ... .. ... . ... .... ... ... .. .. ....... ... ... .... . ...... 34
Requirements Gathering ................................... . .... . ............ . ..... .. .. 36
Problems & Difficulties .................................................................. 37
Main techniques of Information Gathering .................................................. 39
Chapter 3. Software Requirements Specification ............................. .41
Introduction .................................................... . ...... .. ... ........ ....... . 41
Identification .. . ................... ......... .. .. ... ... .. ........ . ........................... 41
System overview ........................................ ........... ......... ........ .. ... 42
Definitions, Acronyms, and Abbreviations ....................................... 43
Reference ............................................................... . ................ ... 43
Genera I Description .................................................. . ................... 43
User Personas and Characteristics ....................... .. ............... ..................... 43
Product Perspective ............................................................. ........... .. .. .. . .. 43
Overview of Functional Requirements ............................................ 44
Overview of Data Requirements ........... .. .. .......... ..... .... .... ..... ...... ... 45
General Constraints, Assumptions, Dependencies, Guidelines ........... 46
External Interface Requirements ......... .. .... ...... ... .. .... .. ............ .. .. .. . 47
Detailed Description of Functional Requirements ............................ .48
Performance Requirements ......................................................... ... ... .. .. .. .. 49
Quality Attributes .................................................................... ... .. ... ... .. ... 50
Other Requirements ............................................... .. ..... ....... .................... 50
Chapter 4. System Design .......................................................... ... 51
Methodology Chosen .................................................................... 51
System Overview ... ... . ....... ... ....... . .. ......... ...... ...................... .. .... ... 52
System Core and front- ends .......... .. ................... ... .. .. .. .. .......... ..... ...... .... 52
Project development process .... ... ... ..... ............ .. ... ..... ................ .... ........... 54
Chapter 5. Software Development PHASES in Details ...... .......... ....... 58
Design Overview ............................................ ...... . .. ................ .. ... 58
Facilities .. ......... . ................. ........ .... . ................... . ........................ ......... .. 58
The core system .... .. ............................... .. .. .. ............................. .. 59
Software development platform ............. . .......................... ......... ............... 59
Intergraded Development Environment Development.. ........................ .. ...... 60
System Design ............................................... .......... ....................... ..... ... 61
.Unit Testing ........................... .............. ................................ ........ .......... 69
Integration Testing ................ . ............................. .. .................. . ... ............ 70
Chapter 6. DISCUSSION ...................................... .. ....................... 72
Interesting parts during development process ................................ . 72
Prototype evaluation ................ .................................. .................. 72
Comments on the evaluation results and related work ..................... 74
Overall project Evaluation ................................ ............................. 75
Chapter 7. Conclusions ...... .... ..... ........................................... .. ..... 77
Future work ............................................................................. ... 78
INDEX .. ............. .. .... ...... ... .. .. ... ..... .... ....... ........ ........ ... ..................... 83
ii
LIST OF TABLES
Table 1 Agile vs Waterfall methodology (available from
http:/ /en. wikipedia.org/wiki/ Agile_software_development) ........................ .. 29
Table 2 Development Phases ........................................ .. ................... .. ................ 57
Table 3 Sample of a Matrix candidate for SVD .. .. .... .. ...... .. .. ................................... 64
List of figures
Figure 1 Google database development ................................................................ 6
Figure 2 The Waterfall Model .............................................................................. 26
Figure 4 Waterfall vs. Agile ................................................................................ 28
Figure 5 System Use Case ................ .. .. ....... .... .. .. .. ............ .. ......... .. ................... 62
Figure 6 System State Qiagram .......................................................................... 63
Figure 7 Users' opinion about the system ................ .... ............ .... .. ..................... 74
iii
Acknowledgments
The author wishes to express sincere appreciation to Mr Staurakakis
Emanuel and Mr Tsagatsakis John for their assistance in the preparation of
this Final year Project report.
iv
INTRODUCTION
In 2001 the Bank of Sweden Prize in Economic Sciences in Memory
of Alfred Nobel was awarded to James Mirrlees and William Vickrey
for their fundamental contributions to the theory of incentives under
asymmetric information.
With their work
(http://www .nobel.se/economics/laureates/2001/ecoadv. pdf)
they have validated not only the importance of the Information but
also the importance of accessibility over this information.
Nowadays everyone in west, especially after the development of the
internet, has access to large amount of data, in electronic or paper
form. The main problem that we usually face is that the volume of
this information is so large that we can not easily handle it, or worse
it has no use.
In order to take advantage of this information we need to categorize
it in thematic cohesion and thus to manageable data. A few decades
ago this was librarians' line, but as already mentioned the volume of
data has increased dramatically in such a degree(Society, 2004)
that the traditional methods of indexing are not in position to face
this new challenge.
The problem gets bigger when we need to categorize new documents
based on their content, of course in many documents their is an ab-
stract on top of them ; but in fact only scientific papers with a special
purpose have this form, for example an abstract is essential for a
paper but not for a newspaper or a magazine.
1
Some people believe that when we talk about retrieving dat<
through internet things are very easy; because there we have thE
assistance of search engines.
The Internet search engines are of the largest and most c o m m o n l ~
used. Huge databases of millions of Web pages typically index ever:
word on each one of the pages.
By using them, searchers expect to find every page that contains ar
occurrence of their search term, while the public in general hopes t <
find pages on the subject of the terms they enter.
The Web search engines and their databases although can find somE
pages that contain the search terms and an occasional page that i ~
actually about the concept represented by the search terms. But thE
majority of engines do not understand the content of the page; the'
only play with statistics and probability.
To make things worse lot of web designers, in order to attract mon
visitors for their pages, use common words in meta-keys or on th1
body of their web pages that do not have any relation with the con
tent of the Pag
[http:/ /webreference.com/ content/search/how. html].
In addition Knowledge isn't all the times well-structured and tha
generating extra difficulties in our effort to use and exploit the a
propriate data. Furthermore knowledge is a dynamic entity gene
ated as a result of social interaction between actors.
In order to overcome all the aforementioned deficiencies a numb
of Information agent suggest the use of intelligent agents or mul
agent environments. In that case autonomous units will guide, pro
2
Information Retrieval techniques. Also in this chapter there is a smooth
introduction in the Agent technology.
The above chapter makes an introduction to the System Development
process. In this chapter the most common System Development Tech-
niques is presented a smooth comparison between them.
The fourth chapter has the form of an official System Requirements Speci-
fication report.
The fifth chapter presents the methodology that was followed for the de-
velopment of the final year project. In addition a discussion about the sys-
tem overview and the project development is taken place.
The sixth chapter gives details about the each of the development phase,
the objectives, the challenges and the difficulties that was met in each
phase.
The final chapter contains the conclusion and points for future work.
4
CHAPTER 1. ANALYSIS I - LITERATURE REVIEW
GOOGLE search engine
Google is a key player in search engines marker owned by
Google Inc.
The mission statement of the company is to : "organize t he world's
information and make it universally accessible and useful. "
Among the largest search engine on the web, Google receives
over 200 million queries each day through its various services
(Economist 2006).
In 2006, Google has indexed over 25 billion web pages, 1.3 billion
images, and over one billion Usenet messages - in total, ap-
proximately 12 billion items. It also caches much of the content
that it indexes. Google operates other tools and services including
Google News, Google Suggest, Froogle, and Google Desktop
Search.
By checking in web archive website we can see that the size of
Google database is developing with high rates.
5
CHAPTER 1. ANALYSIS I- LITERATURE REVIEW
GOOGLE search engine
Google is a key player in search engines marker owned by
Google Inc.
The mission statement of the company is to : "organize the world' s
information and make it universally accessible and useful. "
Among the largest search engine on the web, Google receives
over 200 million queries each day through its various services
(Economist 2006).
In 2006, Google has indexed over 25 billion web pages, 1.3 billion
images, and over one billion Usenet messages - in total, ap-
proximately 12 billion items. It also caches much of the content
that it indexes. Google operates other tools and services including
Google News, Google Suggest, Froogle, and Google Desktop
Search.
By checking in web archive website we can see that the size of
Google database is developing with high rates.
5
Google database development
2,50E+10
2,00E+10
Ul
1,50E+10 ~
C'l
IV
1,00E+10 c.
5,00E+09
I
I
/
-41
__..
__.....---
Month
Figure 1 Google database development
To perform the above task Google use a special algorithm called
"Pagerank". PageRank is a patented method (an algorithm) to as-
sign a numerical weighting to each element of a hyperlinked set of
documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set. The algorithm
may be applied to any collection of entities with reciprocal quota-
tions and references. The numerical weight that it assigns to any
given element E is also called the PageRank of E and denoted by
PR(E) (Ther, 2993).
"PageRank is a probability distribution used to represent the likel i-
hood that a person randomly clicking on links will arrive at any par-
ticular page". PageRank can be calculated for any-size collection o
documents.
6
It is assumed in several research papers that the distribution is
evenly divided between all documents in the collection at the be-
ginning of the computational process. The PageRank computations
require several "iterations", through the collection to adjust ap-
proximate PageRank values to reflect the speculative true value.
The probability is expressed as a numeric value between 0 and 1 (0
and 100%) . That for a PageRank of 0.1 means that the probability
of a person by clicking on a random link will be directed to the
document a 10%
If we suppose the following web pages A, B,C and D. The init ial ap-
proximation of PageRank would be equal apportioned between
these 4 documents (PageRank =100/4=25).
If pages B, C, and D each only link to A, they would each add 0.25
PageRank to A and so PageRank would be:
P R(A) = P R(B ) + P R(C) + P R(D) .
If page B also points to page C while page D has links to all three
pages. The value of the link- pointers is divided among all the out-
bound links on a each page. Thus, page B gives a vote which con-
tribute 0.125 to page A and a vote which contribute 0.125 to page
C. D contribute to A's PageRank
b
PageRankojD
y:
No _of _ Pages _that _ D _ po int s
One iteration of equation (1) is equivalent to computin g
xt+1=Zxt, where xj t=PU) at iteration t. After convergence, we
have xT+1=xT, or xT=ZxT, which means xT is an ei genvector of
7
Z. Furthermore s
eigenvalue of
In other woros,
equal to the d c
mal ized number
The PageRan
ing on links \ ·
between nodes
The probabi lity, a•
damping factor d (
It also assumed
The damping facto
added to the proo ct -
coming Pa geRan sccres
a
).
So any page's PageRank is derived in large part from the PageR-
anks of other pages. The damping factor adjusts the derived value
downward.
Google recalculates PageRank value every t ime it gets and parses
the Web and rebuilds its index. Every t ime Google increases the
number of documents in its collection, the initial approximation of
PageRank decreases for all documents. If a page contains no links
to other pages, " it becomes a sink" and thus the random surfing
process is terminated in that case the random surfer picks another
URL at random and continues surfing again.
When calculat ing PageRank, the pages with no outbound links is
calculated as pages t hat link out to all other pages in the collection.
By doing that their PageRank scores are evenly distributed among
all other pages:
PR(pi) = 1- d + d I PR(p!)
N pjEM(pi ) L(pJ)
where pl,p2, ... ,pN are the pages under consideration, M( pi) is the
set of pages that link to page pi, L(pj) is the number of li nks com-
ing from page pj , and N is the total number of pages.
The PageRank values are the entries of the dominant eigenvector
of the modified adjacency matrix. This makes PageRank eigenvec-
tor to be:
A =
PR(pl)
PR(p2)
PR(pn)
9
Where A is the solution of the equation
(1-d) / N
(1 - d) / N
A.(pl ,pl) A.(pl ,p2)
A= +d
(1 - d) / N
Where the adjacency function ;.. (pi, PJ) is 0 if page Pi does not link
to pi, and normalized such that, for each j
I A. (pi , pJ) = 1
i =l
The above gives a variant of the eigenvector centrality measure.
The eigenvalues shows the direction where the vector is going to
move when it will be multiplied by himself. While the eigenvalue
shows the speed which the vector is going to move.
The values of the PageRank eigenvector are fast to approxi mate
and quite efficient.
In general if the probability of the random surfer to be on page j is
P(j ) = (1- [J) + fJL P(i) (1)
N icBj /Fi I
Where 13 is the probability that the random surfers will jump ran-
domly to a page and N= I WI where W is the set of all nodes, and Fi
is the set of pages page i links to, and Bi be the set pages which
link to page i.
10
I
Then the PageRank for page j is defined as this probability:
PRU)=PU). Because (1) is recursive, it must be iteratively evalu-
ated until PU) converges. Typically, the initial distribution for PU)
is uniform. PageRank is equivalent to the primary eigenvector of
the transition matrix Z:
Z = (1-b)[_!_] + {JM, With M,u = -
1
-. if there is an edge from
N NxN I F1l
i to j or 0 otherwise(Richardson, 2002) .
Text Retrieval Methods
In a conventional information retrieval system the stored text is
normally identified by sets of keywords known as index terms.
Requests for information are typically expressed by Boolean com-
binations of index terms, consisting of search terms and the Boo-
lean operators and, or, and not. The terms characterizing the
stored text may be assigned manually by trained personnel or al-
ternatively, automatic indexing methods can be used. In some
systems one can avoid the content analysis, or indexing opera-
tion, by using words contained in the text of the documents for
content identification. When all text words are used for document
identification (except for common words) we consider such sys-
tem to be a full text retrieval system (Zaphiris and Zacharia,
2001)
All existing approaches to the text retrieval are based on relevant
terms found in the text. A typical approach would be to identify
11
the individual words occurring in the documents. A stop list of
common function words (an, of, the, and, but, etc.) is used to de-
lete the high frequency words that are insufficiently specific to
represent the document content. A suffix stripping routine would
be applied to reduce the remaining relevant words to word stem
form. At that point the vector system would assign a weighting
factor to each term in the document to indicate term importance
and it would represent each document by a set, or vector, of
weighted word stems.
Same other systems, like the signature files, would generate the
signatures and store them in some sort of access structure like
sequential file or S-tree.
The most straightforward way of locating the documents that con-
tain a certain search string (term) is to search all documents for
the specified string (substring test). "String" is a sequence of char-
acters without "Don't Care Characters"(Mock, 1996). If the query is
a complicated Boolean expression that involves many search
strings, then we need an additional step, namely to determine
whether the term matches found by the substring tests satisfy the
Boolean expression (query resolution). The search time for this
automaton is linear on the document size, but the number of states
of the automaton may be exponential on the size of the regular ex-
pression. The obvious algorithm for the substring test is as follows :
• Compare the characters of the search string against the corre-
sponding characters of the document .
• If a mismatch occurs, shift the search string by one position to
the right and continue until either the string is found or the end of
the document is reached.
12
Although simple to implement, this algorithm is too slow. If m is
the length of the search string and n is the length of the document
(in characters), then it needs up to 0 (m * n) comparisons [2].
An other approach over text retrieval has to do with the "Signature
Files". In this method, each document yields a bit string ('signa-
ture'), using hashing on its words and superimposed coding. The
s·,gna\:.ures are stored sequentially in a separate
file (signature file); which is much smaller than the Original file,
and can be searched much faster.
The main advantage of signature files is their low storage over-
head. However, the signature file size is proportional to the text
database size and that becomes a problem for massive text data-
bases like digital libraries.
The signature files contain hashed relevant terms from docu-
ments. Such hashed terms are called signatures and the files con-
taining them are signature files. There are several ways to extract
the signatures from the documents - four basic methods being
WS (Word Signatures), SC (Superimposed Coding), BC (BitBiock
Compression) and RL (Runlength Compression). For example, in
the Superimposed Coding (SC) the text database is divided into a
number of blocks. A_ block Bi is associated with a signature Si,
which is a fixed length bit vector. Si is obtained by hashing each
nontrivial word in the text block into a word signature and OR- ing
them into the block signature. The query is hashed, using the
same signature extraction method used for the documents, into
the query signature Sq. The document search is then done by
searching the signature file and retrieving a set of qualified signa-
tures [Si such that Si AND Sq = Sq] . There are designs for signa-
ture file storage structures or organizations like sequential organi-
zation, a transposed file organization or bit-slice organization, sin-
13
.. _
gle and multi-level organization(Goncalves et al.)s likeS 5-trees
(Deerwester et al., 1990) The most recent signature file organiza-
tion is called the partitioning approach whereby the signatures are
divided into partitions and then the search is limited to relevant
partitions. The motivation for the partitioning approach is a reduc-
tion in the search space as measured either by the signature re-
duction ratio (ratio of the number of signatures searched to the
maximum number of signatures) and by the partition reduction
ratio (ratio of the number of partitions searched to the maximum
number of partitions) Two approaches to the partitioned signature
files were published (Lee et al., 1995). One uses linear hashing to
hash a sequential signature file into partitions or data buckets
containing similar signatures The second approach uses a notion
of a key, which is a sub string selected from the signature by
specifying two parameters - the key starting position and the key
length. The signature file is then partitioned so that the signatures
containing one key are in one partition. The published perform-
ance results on the partitioned signature files are based mostly on
simulations of small text databases and are not conclusive. There
was not any attempt to address the scalability of the partitioned
signature files to massive text databases. The partitioned signa-
ture files grow linearly with the text database size and thus they
exhibit the same scalability problem as other text access struc-
tures .
On the other hand some scientist believes that we can use inver-
sion method in or:der to archive very good retrieval results. Each
document can be represented by a list of (key)words, which de-
scribe the contents of the document for retrieval purposes. Fast re-
trieval can be achieved if we invert on those keywords. The key-
words are stored, eg., alphabetically, in the 'index file'; for each
keyword we maintain a list of pointers to the qualifying documents
14
in the 'postings file'. This method is followed by almost all the
commercial systems (Faloutsos and Oard, 1995) .
Started from internet a method called metadata-based indexing
has gain a great position in librarian science. Metadata is not fully
data, but it is a kind of fellow traveler with data, supporting it from
the sidelines. A definition is that · an element o{ metadata de-
scribes an information resource or helps provide access to an in-
formation resource'(Faloutsos and Oard, 1995).
In the context of Web pages on the Internet, the term · metadata'
usually refers to an invisible attached to a Web page which facili -
tates collection of information by automatic indexers; the is invisi-
ble in the sense that it has no effect on the visual appearance of
the page when viewed using a standard Web browser such a Net-
scape TM or Microsoft's Internet Explorer TM .
Natural Language processing
Natural language processing techniques seek to enhance perform-
ance by matching the semantic content of queries with the seman-
tic content of documents [33, _49, 76). Although it has often been
claimed that deeper semantic interpretation of texts and/or queries
will be required before information retrieval can reach its full poten-
tial, a significant performance improvement from automated se-
mantic analysis techniques has yet to be demonstrated.
The boundary between natural language processing and shallower
information retrieval techniques is not as sharp as it might first ap-
pear, however. The commonly used stoplists, for example, are in-
tended o remove words with low semantic content. Use of phrases
15
as indexing terms is another example of integration of a simple
natural language processing technique with more traditional infor-
mation retrieval methods.
Neural Network as infrastructure in retrieval
The main idea in this class of methods is to use the spreading acti-
vation methods. The usual technique is to construct a thesaurus,
either manually or automatically, and then create one node in a
hidden layer to correspond to each concept in the thesaurus.
Jennings and Higuchi have reported results for a system designed
to filter USENET news articles in . Their Implementation achieves
reasonable performance in a large scale information filtering
task(Badal and Davies).
Latent Semantic Indexing
Latent Semantic Indexing (LSI) is a vector space information re-
trieval method which has demonstrated improved performance
over the traditional vector space. We begin with a basic implemen-
tation which captures the essence of the technique. From the com-
plete collection of documents a term document matrix is formed in
which each entry consists of an integer representing the number of
occurrences of a specific term in a specific document. The Singular
Value Decomposition (SVD) of this matrix is then computed and
small singular values are eliminated. The effectiveness of LSI de-
pends on the ability of the SVD to extract key features from the
term frequencies across a set of documents. In order to understand
this behaviour it is first necessary to develop an operational inter-
pretation of the three matrices which make up the SVD(Deerwester
et al., 1990).
16
Latent Semantic Algorithm
Latent semantic analysis (LSA) is used to define the theme of a
text and to generate summaries automatically . The theme informa-
tion - the already known information - in a text can be represented
as a vector in semantic space; the text provides new information
about this theme, potentially modifying and expanding the seman-
tic space itself. Vectors can similarly represent subsections of a
text . LSA can be used to select f rom each subsection the most typi-
cal and most important sentence, t hus generating a ki nd of sum-
mary automatically(Turney, 2005) .
Advantages of Neural Network Models o v er
Traditional IR Models
In neural network models, information is represented as a network
of weighted, interconnected nodes. In contrast to traditional infor-
mation processing methods, neural network models are "self-
processing" in that no external program operates on the network :
the network literally processes itself, with "intelligent behavior"
emerging from the local interactions that occur concurrently be-
tween the numerous network components (Reggia & Sutton,
1988). Neural network models in general are fundamentally differ-
ent from traditional information processing models in at least t wo
ways (Doszkocs, 1990).
•First they are self-processing. Traditional information processing
models typically make use of a passive data structure, which is al-
ways manipulated by an active external process/procedure. In con-
trast, the nodes and links in a neural network are active processi ng
agents. There is typically no external active agent that operates on
them. "Intelligent behavior" is a global property of neural network
models.
17
•Second, neural network models exhibit global system behaviors
derived from concurrent local interactions on their numerous com-
ponents. The external process that manipulated the underlying
data structures in traditional IR models typically has global access
to the entire network/rule set, and processing is strongly and ex-
plicitly sequentialized . Pandya and Macy (1996) have summarized
that neural networks are natural classifiers with significant and de-
sirable characteristics, which include but no limit to the follows.
•Resistance to noise
•Tolerance to distorted images/patterns (ability to generalize)
•Superior ability to recognize partially occluded or degraded images
•Potential for parallel processing
Furthermore LSA algorithm has a unique ability to be (Dumais) to
be work in cross-language environment with fully automatic corpus
analysis.
Speciai issues on web Information Retrieval
The techniques that should be followed when users are looking for
specific information on the internet are different enough from
those that are used in the conventional IR systems and that be-
cause there special issues relevant to user behaviour and the na-
ture of the data that is stored in WWW
At a glance we can determine some key reason which justifies the
special nature or the web information retrieval.
• The internet users often provide very shorted length que-
ries, while they seem to be unwilling to provide more inputs. In
addition they do not pay the appropriate attention in the way that
they express their questions and that sometimes cause question to
18
be vague, thus the results returned from the search do not suit
with the real informative need of user.
• The pages collection changes constantly, while thousands
of new pages are generated daily in the World Web, other are pre-
sented in a different way, and some are removed .
• The information usefulness of every page varies . Certain
pages focus particularly in a subj ect, whil e provi de info' s about a
set of topics without any connection between them, some pages
work as directory services for other pages and sometime pages
are totally irrelevant t o the search t opic.
• The quality of information curri ed by each page cannot be
verified in advanced. Even worse some authors provides inaccu-
rate data or even the page is structured is such way t o manipulate
user expectations (Spam) .
The pre-processing of all pages living on the www demands a
high cost in terms of time and space while must become a cont in-
ues process since www is a dynamic living entity.
TheAagent's Technology.
Introduction
In computer science software agent is an abstraction that describes
computer programs that can assists the user with comput er appli-
cations (Mosoud, 2004).
The term recently has been extended with the use of adj ectives
like intelligent (agent that employs AI techniques) " autonomous
agents (capable of modifying the way in which they achieve their
19
objectives) , distributed agents (being executed in deferent ma-
chines), multi-agent systems (distributed agents that in order to
accomplish their task must communicate with each other) , mobile
agents (agents that can relocate their execution to different proc-
essors) and more.
Generally speaking agent concept includes properties that make
them special in Computer Science field. Among others we can
referee to the following:
o · autonomy: agents operate without the direct intervention
of humans or others, and have some kind of control over
their actions and internal state (Castelfranchi, 1995);
o · social abilit y: agents interact with other agents (and
sometime with humans) via some agent-communication
language (Genesereth and Ketch pel, 1994);
o · Reactivit y: " agents perceive their environment"(, (which
may be the physical world, a user via a graphical user in-
terface, a coll ection of other agents, the INTERNET, or
perhaps allof these combined), and respond in a timely
fashion to changes that occur in it (Wooldridge, 1995);
Categories of agents i n more details
Intelligent agents:
Intelligent agents' development is a branch of AI research.
These types of agents are called Intelligent because the have
special capabilities:
Ability to Learn
20

Bsc Thesis part1 - WEB BASED INFORMATION RETRIEVAL

Comments

Content

Sponsor Documents

Recommended