MS Project Report

Published on June 2016 | Categories: Documents | Downloads: 55 | Comments: 0 | Views: 181
of 26
Download PDF   Embed   Report

Comments

Content

Advisor: Dr. Shen

Documented by Vijay Gomatam

Index No. 1 2 3 4 56 67

Chapter Abstract Introduction System Architecture Project Work Flow Diagram Installation Requirements References

Page No. 3 5 11 18 201 223

2

3

ABSTRACT
Travel Search Engine is a travel centric search engine implementation that people could use it for finding the activities in their favorite placesdestinations. Users could access the search engine like any other web application through HTTP URL request and response. Every search result of the Travel Search Engine is broadly classified in to three sub groups viz., Destination Search Results, Activities Search Results and Web / JournalReview Search Results. Destination Search Results groups the results that are specific to the destination area. Activities Search Results groups the results based on the activity that is selected. Web / Journal-Review Search group the results from third-party search service providers. The backbone of the search engine is based on Web Service architecture, which is further based on 2 fundamental operations: local data query using Content Indexing and generic web query using Web Service Architecture. The purpose of this project is to learn the various possibilities for implementing the search engine using web services and industry standard search mechanism. Java Technologies related to web services and web-programming methodologies have been implemented for this project.

4

5

2.1

HOW DO SEARCH ENGINES WORK?

General WEB Search Engines do not really search the World Wide Web directly. Each one searches a database of the full text of web pages selected from the billions of web pages out there residing on servers. When the user searches the web using a search engine, user is always searching a somewhat stale copy of the real web page. When clicked on links provided in a search engine's results, application retrieves current version of the page. Search engine databases are selected and built by computer robot programs called spiders. Although it is said they "crawl" the web in their hunt for pages to include, in truth they stay in one place. They find the pages for potential inclusion by following the links in the pages they already have in their database. After spiders find pages, they pass them on to another computer program for "indexing." This program identifies the text, links, and other content in the page and stores it in the search engine database's files so that the database can be searched by keyword and whatever more advanced approaches are offered, and the page will be found if your search matches its content. Meta-search engine, another type of search engine in contrary to database oriented approach, you submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages. Within a few seconds, you get back results from all the search engines queried. Metasearch engines do not own a database of Web pages; they send your search terms to the databases maintained by search engine companies. Few meta-search implementations include clustering and linguistic analysis that attempts to show you themes within results, and some fancy textual analysis and display that can help you dig deeply into a set of results. Few meta-searchers allow you to delve into the largest, most useful search engine databases. They tend to return results from smaller and/or free search engines and miscellaneous free directories, often small and highly commercial.

6

2.2

LOCATING AND ACCESSING SEARCH DATA

The fastest way of retrieving the search data is directly proportional to how much organized is the data. A few of the fastest search engines viz., Nutch, Google etc., use indexing logic to index the data files, which makes the information to be stored in an organized structure and making the search mechanism faster. Lucene is a Java-based open-source framework for text indexing and searching. It is a flexible, fully customizable and amazingly fast search engine. It provides building blocks to build a search engine based on the requirements. It integrates directly with the Web application. Any Java application can use Lucene as the core of any search functionality. Lucene works with any kind of text data; however, there is no built-in support for Word, Excel, PDF, and XML. But there are solutions to support each of them with Lucene. One important point about Lucene is that it is just a search engine. There isn't a built-in Web GUI or a Web crawler. So, I have used Lucene to build the search solution for the Travel Search Engine application, by adding servlets and JSP pages to process the input query and display the results. Lucene internally creates its own file-based indexes and organizes the text data accordingly. Lucene works on two fundamental steps viz., indexing documents based on an index structure and searching over the query in the documents.

1) Indexing Documents is the first step in creating Lucene Index in a directory,
using different analyzer algorithms. Lucene index is a collection of documents organized in a way that allows quick retrieval of information when arbitrarily queried upon. Each document in a lucene index is made up of one or more fields that are name-value pairs, much like entries in a HashMap. The fundamental concepts in Lucene are index, document, field and term.

7

a) Index: An index contains a sequence of documents. b)Document: A document is a sequence of fields. c) Field: A field is a named sequence of terms. d)Term: A term is a string.
The same string in two different fields is considered a different term. Thus, terms are represented as a pair of strings, the first naming the field, and the second naming text within the field. Each field in a document can be defined as being any combination of stored, indexed and tokenized. If a field is stored, its contents are fully retrievable upon a successful search. If a field is indexed, its content may be referenced in a query and searched upon. If a field is tokenized, its content is broken into one or more tokens prior to being indexed. Depending on the size of the file, many groups of files with same name and different extensions are created. Each of these groups is known as “segment”. Lucene keeps track of each segment using a file called “segment”. During indexing, it occasionally becomes necessary for lucene to update the segments in the index. While this synchronization is going on, lucene creates a lock for prevent data corruption.

2) Searching:

Now that the indexes are built, search mechanism accesses the

indexes and queries the contents. The search can be performed on single index or multiple indexes. Ultimately the result is collected in a single result set. The search method returns Hits, ordered collection of documents matching the query. Query Parser, is used to parse the query string and builds an appropriate Query object. For faster and consistent results, its recommended that the same Analyzer to be used for parsing queries that was used when indexing the documents.

8

2.3

DEVELOPMENT AND DEPLOYMENT

In developing the Travel Search Engine, I have considered the web-programming paradigm model, Struts, which is an open-source MVC framework for java based web applications. Struts, a web application development framework, is designed for J2EE applications. It provides the basic skeleton and plumbing, with a set of Java code designed to make development process easier and scalable. Struts can be used to build complex applications as a series of basic components: Views, Action classes and Model components. The framework makes it easier to isolate the presentation layer of the application from that of business logical layer. Every user action is configured to a business handler using a configuration file. This provides the flexibility to change even the business components with out any change in the presentation part.

9

Struts is based on the time-proven Model-View-Controller (MVC) design pattern, where the processing is broken into three distinct sections viz., the Model, the View and the Controller. “Model” provides the model of the application business logic. Model contains the standard Java classes. It has no specific format. Typically model is represented in a Java Bean. “View” components are those pieces of application that present the information to the users and accept input from the web pages. View is generally built using Java Server Pages (JSP), XSL, HTML files. Struts provide a large number of custom JSP tags, which extend the normal capabilities of JSP and simplify the development of View components. “Controller” coordinates activities in the application. The Action Servlet, which functions as a controller, centralizes the logic for dispatching the requests to the implementation Action object based on the request URL, input parameters, and application state. The

10

controller also handles view selection, which de-couples JSP pages and servlets from one another. Controller provides a single point control of security and logging, and encapsulates the incoming data into a form usable by the back-end MVC model. For the deployment of the project, I have selected Apache Tomcat Application Server, an open-source servlet container, and officially referenced Web Server for Java Servlets and Java Server Pages, by Sun Micro Systems. Tomcat is widely available and easy to install. Tomcat servlet engine is very compatible with a Struts-based development. The WAR (Web Archive) file that is generated with the “ant build” utility is placed in the “webapps” folder of Tomcat Container. When the Tomcat Server starts, it deploys the war file and unpacks all the classes and libraries.

11

SYSTEM ARCHITECTURE
The System is developed to demonstrate how the Travel Search Engine works to serve the users’ search query request. The Overall system architecture can be divided into five sections. • • • • Graphical User Interface Business Layer Lucene-based Content Index Searching Integration-Tier Using JIBX Binding

GRAPHICAL USER INTERACE
The presentation layer is built using HTML/JSP pages. HTML Text UI input field accepts the user search criteria and sends the same as a query to the backend process. UI

12

provides the user with two options, general web search results or journal-review search results. HTTP REQUEST TOMCAT STRUTS HTTP RESPONSE The backend business process parses the query to determine for the following: 1) Destination 2) Activity 3) Member ID – tied to destination 4) Weather – tied to destination 5) Combination of destination and/or activity results a. Can be retrieved as web results b. Can be retrieved as journal-review results

BUSINESS LAYER
Business layer is implemented using Struts, MVC based architecture to accept all the users requests through controller and dispatches the request to the corresponding Search Action implementation object. The mapping of each business implementation object to every user request is provided in the configuration file, which will be processed and implemented by the controller i.e., the Action-Servlet class. Each business implementation object, Action class, will have the logic to process the query and gets the results from back end process. The model data is then passed on to JSP pages to render as a result back to the user.

EVENT

DISPATCH

CONTROLLER SERVLET 13

BUSINESS LOGIC

HTTP REQUEST

CLIENT BROWSER

FORWARD

STRUTSCONFIG.XML

UPDATE HTTP RESPONSE

VIEW JSP

GET <TAG>

MODEL APPLICATION STATE

LUCENE-BASED CONTENT INDEX SEARCHING
Lucene, a Java-based open source toolkit, has its own content index algorithms and performing the search upon parsing the query string.
PARSED QUERY

HTTP REQUEST

TOMCAT STRUTS

QUERY PARSING USING LUCENE API

LUCENE FILE SYSTEM

HITS COLLECTION

HTTP RESPONSE XML RESPONSE

HTTP REQUEST

WEB SEARCH

14

Lucene is used to convert the huge amount of text data, containing destination, activity and member ID details, into well-organized documents. It indexes the documents into segments, which makes the search faster. Initially, I have retrieved the content index file and used Lucene to index it into 2 folders: 1) a) b) a) b) Destination, that contains documents related to destination and activity index. /destination/_ckl.cfs – compound file format for storing destination and activity. /destination/segment – to keep track of destination and activity indexes 2) Members, that contains documents related to member ID. /members/_asyz.cfs – compound file format for storing the members. /members/segment – to keep track of member indexes. When the user enters the search query, the lucene parses the query as a preliminary step for searching and accordingly searches the query in its self-organized file system.

INTEGRATION-TIER USING JIBX BINDING
Integration tier uses the binding principles to process the retrieved xml responses from the third party service, and ultimately delivers Java objects to the business logic tier. JiBX is an open source framework for binding XML response to Java objects. The JiBX framework handles all the details of converting data to and from XML based on its own class structures and proprietary instructions. JiBX is designed to perform the translation between internal data structures and XML with very high efficiency, but still allows you SERVICES THE a high degree of control over the translation process. HTTP REQUEST AND RETURNS HTTP REQUEST AN OPEN HTTP TRAVEL TOMCAT REQUEST ALLIANCE [OTA] STANDARD XML STRUTS RESPONSE BASED ON WEB HTTP SEARCH OR SITE RESPONSE SEARCH [ALSO BASED ON CONTENT INDEXING] 15

BUSINESS LOGIC TIER

JAVA OBJECTS

INTEGRATION TIER BINDING XML RESPONSE TO JAVA OBJECTS USING JIBX

XML RESPONSE

JiBX makes the binding process easier, faster and efficient. It contains a) b) c) the binding definition xml file i.e., SearchRes-jibx.xml the xml response from the third party Nutch API and the converted java objects

16

BINDING DEFINITION FILE <binding> <mapping name=” SearchResults” class ” SearchResultsTypeImpl” = > <structure name="Hits" type="HitsType"> <value name="Query" set-method="setQuery" get-method="getQuery"/> <value name="TotalHitCount" set-method="setTotalHitCount get-method="getTotalHitCount /> " " <value name="From" set-method="setFrom" get-method="getFrom"/> <value name="To" set-method="setTo" get-method="getTo"/> </structure> … … … </mapping> </binding> OTA STANDARD XML DOCUMENT <?xml version="1.0" encoding="UTF-8"> <SearchResults> <Hits> <Query> </Query> <TotalHitCount> </ TotalHitCount> <From>1</From> <To>10</To> <Hit> <No>1</No> <URL> </URL> <Title> </Title> <Summary> </ Summary> <Id> </Id> </Hit> … … <Hit> <No>10</No> <URL> </URL> <Title> </Title> <Summary> </ Summary> <Id> </Id> </Hit> </Hits> <Stats> <TotalTime> </ TotalTime> <SearchTime> </ SearchTime> <SpellcheckerTime </ > SpellcheckerTime > </Stats> </SearchResults> JAVA CLASSES public class SearchResultsTypeImpl { public HitsType Hits ; ... } public class HitsType { public String Query ; public String TotalHitCount ; public String From ; public String To ; public String getQuery (){ String query = Query; } public void setQuery (String Query){ return Query; } … … ... }

17

In JiBX, the binding process is handled in two fundamental steps: BINDING COMPILER JiBX uses binding definition documents to define the rules for how your Java objects are converted to or from XML. At some point, after building the search project into class files, the first part of the JiBX framework is executed, the binding compiler using the “jibx-bind.jar”. This compiler enhances binary class files produced by the Java compiler, adding code to handle converting instances of the classes to or from XML. After running the binding compiler you can continue the normal steps you take in assembling your application. BINDING RUNTIME The enhanced class files generated by the binding compiler use this runtime component both for actually building objects from an XML input document (called unmarshalling, in data binding terms) and for generating an XML output document from objects (called marshalling). Travel Search Engine performs unmarshalling to build objects from the OTA Response.

18

19

PROJECT WORK FLOW DIAGRAM
SEARCH REQUEST WORDS

Search Criteria Destination Or/And Activity

Q U E R Y P A R S I N G

LUCENE BASED CONTENT INDEXING

S
Destination Details or Activity Details for each destination and activity list.

CONTENT INDEX FILE THAT CONTAINS DESTINATION, ACTIVITY AND DESTINATION MEMBER DETAILS INDEXED IN SEGMENTS

T R U T S F R A M E W O R K

RETRIEVE DESTINATON AND ACTIVITY DETAILS

HTTP REQUEST TO THE EXTERNAL SERVER WITH THE OPTION OF WEB SEARCH OR SITE SEARCH

Web Search Returns Web Results Site Search Returns Journal Results + Destination Member Details

BUSINESS LOGIC TIER INTEGRATION TIER BINDING XML RESPONSE TO JAVA OBJECTS USING JIBX

Search Option

SERVICES THE HTTP REQUEST AND RETURNS AN OPEN TRAVEL ALLIANCE [OTA] STANDARD XML RESPONSE BASED ON WEB SEARCH OR SITE SEARCH [ALSO BASED ON CONTENT INDEXING]

20

21

INSTALLATION REQUIREMENTS
22

JVM Environment: As I have used new features of JDK1.5, JVM version of 1.5 is required for the development. Application Server: Tomcat 5.0.28 Browser: Internet Explorer 6 or later Project Deployment: The ant build utility creates a war file, which can be deployed onto the container.

DEVELOPMENT ENVIRONMENT
1) Languages: Java 1.5, JSP, Servlets, XML, XSLT, HTML and JavaScript 2) Framework: Apache Jakarta Struts 1.2.7 3) Content Indexing Algorithm: Apache Lucene 4) External Search API: Nutch, built based on Lucene 5) Binding: JIBX 6) Build: Apache Jakarta Ant 1.6.5 7) Parsers: DOM, Xalan 8) Loggers: Apache Jakarta Log4J 1.2.12 9) Web Server: Apache Tomcat 5.0.28

23

REFERENCES
24

1) LUCENE: a. Apache Lucene: http://lucene.apache.org/
b. “search enable your application with LUCENE”, Java Developers Journal, Dec 2002

c. “Lucene 2) STRUTS:

in Action”, Erik Hatcher and Otis Gospodnetić, Manning

Publications Co, Dec 2004

a. Apache Strut: http://struts.apache.org/

b. “Struts Kick Start”, James Turner and Kevin Bedell 3) JIBX:
Apache Software Foundation: http://jibx.sourceforge.net/

4) NUTCH:
Apache Nutch Website: http://lucene.apache.org/nutch/

5) WEB SERVICES:
“Beginning Java Web Services”, H. Bequet, M.M. Kunnumpurath, S. Rhody, A. Tost , Wrox Press, Mar 2003.

6) TOMCAT:
Apache Tomcat: http://tomcat.apache.org/

7) ANT:
25

Apache Ant: http://ant.apache.org/

8) LOG4J:
Apache Logging Services: http://logging.apache.org/log4j

26

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close