Neural Networks for Web Content Filtering

Published on January 2017 | Categories: Documents | Downloads: 29 | Comments: 0 | Views: 208
of 10
Download PDF   Embed   Report

Comments

Content

N e u r a l

N e t w o r k s

Neural Networks for Web Content Filtering
Pui Y. Lee and Siu C. Hui, Nanyang Technological University Alvis Cheuk M. Fong, Massey University, Albany

W
The Intelligent Classification Engine uses neural networks’ learning capabilities to provide fast, accurate differentiation between pornographic and nonpornographic

ith the proliferation of harmful Internet content such as pornography, violence, and hate messages, effective content-filtering systems are essential. Many

Web-filtering systems are commercially available, and potential users can download trial versions from the Internet. However, the techniques these systems use are insufficiently
accurate and do not adapt well to the ever-changing Web. (For more on current approaches and systems, see the “Web Content-Filtering Approaches” and “Current Web Content-Filtering Systems” sidebars.) To solve this problem, we propose using artificial neural networks (ANNs)1,2 to classify Web pages during content filtering. We focus on blocking pornography because it is among the most prolific and harmful Web content. According to CyberAtlas, pornography-related terms such as “sex” and “porn” are among the top 20 search terms queried at the 10 leading Internet portals and search engines.3 Furthermore, research suggests that pornography is addictive and causes harmful side effects.4 However, our general framework is adaptable for filtering other objectionable Web material. ments used to construct a multiframe Web page as a single entity. This is because statistics obtained from any aspects of a Web page should be derived as a whole from the aggregated data collected from every HTML document that is part of that page. PICS use. Web publishers can use PICS labels to limit access to Web content (see the “Web ContentFiltering Approaches” sidebar). A PICS label is usually distributed with the associated Web page by one of these methods: • The Web publisher (or Web content owner) embeds the label in the HTML code’s header section. • The Web server inserts the label in the HTTP packet’s header section before sending the page to a requesting client. In both cases, we can’t determine whether a Web page has a PICS label simply by inspecting the Web page contents displayed in a browser. The first method requires viewing the Web page’s HTML code, which you can easily do with any major Web browser that supports such functionality. In contrast, the second method requires checking the HTTP packet’s header section, which you can’t do with any known browser. Consequently, to analyze pornographic Web sites’ use of PICS systems, we collect statistics from the samples’ HTML code. Indicative terms. Terms (words or phrases) that indicate pornographic Web pages fall into two major groups according to their meanings and use. Most
IEEE INTELLIGENT SYSTEMS

Know the enemy

Web pages. The basic framework can also serve to distinguish other types of Web content.

How do pornographic Web pages differ from others? We attempt to answer this question by studying these pages’ characteristics and analyzing data. Characteristics Understanding pornographic Web pages’ characteristics can help us develop effective content analysis techniques. Although it is well known that pornographic Web sites contain many sexually oriented images, text and other information can also help us distinguish these sites. We focus on three characteristics: page layout format, use of PICS (Platform for Internet Content Selection) ratings, and indicative terms. Page layout format. We treat all the HTML docu-

48

1094-7167/02/$17.00 © 2002 IEEE

Web Content-Filtering Approaches
The four major content-filtering approaches are Platform for Internet Content Selection, URL blocking, keyword filtering, and intelligent content analysis. can first identify the nature of a Web page’s content. If the system determines that the content is objectionable, it can add the page’s URL to the black list. Later, if a user tries to access the Web page, the system can immediately make a filtering decision by matching the URL. Dynamically updating the black list achieves speed and efficiency, and accuracy is maintained provided that content analysis is accurate.

Platform for Internet Content Selection
PICS is a set of specifications for content-rating systems (for URLs for these rating systems, see the “Useful URLs” sidebar). It lets Web publishers associate labels or metadata with Web pages to limit certain Web content to target audiences. The two most popular PICS content-rating systems are RSACi and SafeSurf. Created by the Recreational Software Advisory Council, RSACi uses four categories—harsh language, nudity, sex, and violence. For each category, it assigns a number indicating the degree of potentially offensive content, ranging from 0 (none) to 4. SafeSurf is a much more detailed content-rating system. Besides identifying a Web site’s appropriateness for specific age groups, it uses 11 categories to describe Web content’s potential offensiveness. Each category has nine levels, from 1 (none) to 9. Currently, Microsoft Internet Explorer, Netscape Navigator, and several Web-filtering systems offer PICS support and can filter Web pages according to the embedded PICS rating labels. However, PICS is a voluntary self-labeling system, and each Web content publisher is totally responsible for rating the content. Consequently, Web-filtering systems should use PICS only as a supplementary filtering approach.

Keyword filtering
This intuitively simple approach blocks access to Web sites on the basis of the occurrence of offensive words and phrases on those sites. It compares every word or phrase on a retrieved Web page against those in a keyword dictionary of prohibited words and phrases. Blocking occurs if the number of matches reaches a predefined threshold. This fast content analysis method can quickly determine if a Web page contains potentially harmful material. However, it is well known for overblocking—that is, blocking many Web sites that do not contain objectionable content. Because it filters content by matching keywords (or phrases) such as “sex” and “breast,” it could accidentally block Web sites about sexual harassment or breast cancer, or even the home page of someone named Sexton. Although the dictionary of objectionable words and phrases does not require frequent updates, the high overblocking rate greatly jeopardizes a Web-filtering system’s capability and is often unacceptable. However, a Webfiltering system can use this approach to decide whether to further process a Web page using a more precise content analysis method, which usually requires more processing time.

URL blocking
This technique restricts or allows access by comparing the requested Web page’s URL (and equivalent IP address) with URLs in a stored list. Two types of lists can be maintained. A black list contains URLs of objectionable Web sites to block; a white list contains URLs of permissible Web sites. Most Webfiltering systems that employ URL blocking use black lists. This approach’s chief advantages are speed and efficiency. A system can make a filtering decision by matching the requested Web page’s URL with one in the list even before a network connection to the remote Web server is made. However, this approach requires implementing a URL list, and it can identify only the sites on the list. Also, unless the list is updated constantly, the system’s accuracy will decrease over time owing to the explosive growth of new Web sites. Most Web-filtering systems that use URL blocking employ a large team of human reviewers to actively search for objectionable Web sites to add to the black list. They then make this list available for downloading as an update to the list’s local copy. This is both time consuming and resource intensive. Despite this approach’s drawback, its fast and efficient operation is desirable in a Web-filtering system. Using sophisticated content analysis techniques during classification, the system

Intelligent content analysis
A Web-filtering system can use intelligent content analysis to automatically classify Web content. One interesting method for implementing this capability is artificial neural networks, which can learn and adapt according to training cases fed to them.1 (Our Intelligent Classification Engine uses this method—see the main article.) Such a learning and adaptation process can give semantic meaning to context-dependent words, such as “sex,” which can occur frequently in both pornographic and nonpornographic Web pages. To achieve high classification accuracy, ANN training should involve a sufficiently large number of training exemplars, including both positive and negative cases. ANN inputs can be characterized from the Web pages, such as the occurrence of keywords and key phrases, and hyperlinks to other similar Web sites.

Reference
1. G. Salton, Automatic Text Processing, Addison-Wesley, Boston, 1989.

are sexually explicit terms; the rest consist primarily of legal terms used to establish the legal conditions of use of the material. Legal terms often appear because many pornographic Web sites’ entry pages contain a warning message block. Most indicative terms occur in the pornographic Web page’s text. We can extract them from different locations of the corresponding HTML document that might contain information useful for distinguishing pornoSEPTEMBER/OCTOBER 2002

graphic Web sites. These locations are • The Web page title • The warning message block • Other viewable text in the Web browser window • The “description” and “keywords” metadata • The Web page’s URL and other URLs embedded in the Web page • The image tooltip (the text string displayed when a user points to an object using a
computer.org/intelligent

mouse; it usually occurs in the <IMG> tag) • Graphical text (sometimes graphics or images contain text that we can extract) Indicative terms might be displayed or nondisplayed in the Web browser window. Displayed terms appear in the Web page title, warning message block, other viewable text, and graphical text. Nondisplayed terms are stored in the URL, the “description” and “keywords” metadata, and the image tooltip.
49

N e u r a l

N e t w o r k s

Current Web Content-Filtering Systems
Web-filtering systems are either client- or server-based. A client-based system performs Web content filtering solely on the computer where it is installed, without consulting remote servers about the nature of the Web content that a user tries to access. A server-based system provides filtering to computers on the local area network where it is installed. It screens outgoing Web requests, analyzes incoming Web pages to determine their content type, and blocks inappropriate material from reaching the client’s Web browser. Table A summarizes the features of 10 popular Web-filtering systems (for URLs for these systems, see the “Useful URLs” sidebar). Only two systems specifically filter pornographic Web sites. Bold text indicates each system’s main approach. (For a description of the approaches, see the “Web ContentFiltering Approaches” sidebar.) No system relies on PICS (Platform for Internet Content Selection) as its main approach. Six systems rely mainly on URL blocking, and only two systems mainly use keyword filtering. I-Gear incorporates both URL blocking and Dynamic Document Review, a proprietary technique that dynamically looks for matches with keywords. Only WebChaperone (which is no longer sold or supported) employs content analysis as its main approach. It uses Internet Content Recognition Technology (iCRT), a unique mechanism that dynamically evaluates each Web page before the system passes the page to a Web browser. iCRT analyzes word count ratios, page length, page structure, and contextual phrases. The system then aggregates the results according to the attributes’ weighting. On the basis of the overall results, WebChaperone identifies whether the Web page contains pornographic material. To gauge the filtering approaches’ accuracy, we evaluated five representative systems: Cyber Patrol, Cyber Snoop, CYBERsitter, SurfWatch, and WebChaperone. We installed each system on an individual computer and visited different Web sites while running the system. We collected the URLs of 200 pornographic and 300 nonpornographic Web pages and used them for evaluation. To more easily measure each approach’s accuracy, we limited each system to only its major approach. So, we configured Cyber Patrol and SurfWatch to use URL blocking, Cyber Snoop used keyword filtering, CYBERsitter used contextbased key phrase filtering, and WebChaperone used iCRT. Table B summarizes our results. We obtained the overall accuracy by averaging the percentage of correctly classified Web pages for both pornographic and nonpornographic pages. For the two systems that employ URL blocking, the number of incorrectly classified nonpornographic Web pages is small compared to those using keyword filtering. This shows that the black list (a list of URLs to block—see the “Web Content-Filtering Approaches” sidebar) the specialists compiled accurately excludes URLs of most nonpornographic Web sites, even if the Web sites contain sexually explicit terms in a nonpornographic context. However, both systems have fairly high occurrences of incorrectly classified pornographic Web pages. This highlights the problem of keeping the black list up to date. Systems that rely on keyword filtering tend to perform well on pornographic Web pages, but the percentage of incorrectly classified nonpornographic Web pages can be high. This highlights the keyword approach’s major shortcoming. On the other hand, WebChaperone, which uses iCRT, achieves the highest overall accuracy: 91.6 percent. This underlines the effectiveness

Table A. A comparison of 10 popular Web-filtering systems. Bold lettering indicates each system’s main approach. Content-filtering approach System Cyber Patrol Cyber Snoop CYBERsitter I-Gear Net Nanny SmartFilter SurfWatch WebChaperone Websense X-Stop Location Client Client Client Server Client Server Client Client Server Client PICS support Yes Yes Yes Yes No No Yes Yes No Yes URL blocking Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Keyword filtering Yes Yes YesA YesB Yes No Yes Yes YesD YesE Content analysis No No No No No No No YesC No No Filtering domain General General General General General General General Pornographic General Pornographic

A. Context-based key-phrase filtering technique B. Employed in Dynamic Document Review and software robots C. Content analysis carried out using iCRT (Internet Content Recognition Technology) D. Employed only in Web crawlers E. Employed in MudCrawler Table B. The performance of five Web-filtering systems. Web page Pornographic (200 total) System Cyber Patrol SurfWatch Cyber Snoop Cybersitter WebChaperone Major approach URL blocking URL blocking Keyword filtering Context-based key phrase filtering iCRT Correctly classified Incorrectly classified 163 171 187 183 (81.5%) (85.5%) (93.5%) (91.5%) 37 (18.5%) 29 (14.5%) 13 (6.5%) 17 (8.5%) 23 (11.5%) Nonpornographic (300 total) Correctly classified Incorrectly classified 282 287 247 255 (94.0%) (95.7%) (82.3%) (85.0%) 18 (6.0%) 13 (4.3%) 53 (17.7%) 45 (15.0%) 16 (5.3%) Overall accuracy 87.75% 90.60% 87.90% 88.25% 91.60%

177 (88.5%)

284 (94.7%)

50

computer.org/intelligent

IEEE INTELLIGENT SYSTEMS

Useful URLs
Nondisplayed text is contained in HTML tags (blocks that begin with a “<” and end with a “>”); displayed text is not. Statistics For sample collection, we consider that a pornographic Web site fulfills any of these conditions: • It contains sexually oriented content. • It contains erotic stories and text descriptions of sexual acts. • It shows images of sexual acts, including inanimate objects used sexually. • It shows erotic full or partial nudity. • It contains sexually violent text or graphics. To generally represent such Web sites, we captured 100 different pornographic Web sites and defined their entry pages as samples for data collection. Page layout format. We found that 86 percent of the entry pages used a single-frame layout. Among the 14 multiframe pages, 13 had a two-frame layout. Only one Web page had a three-frame layout. Although a singleframe layout appears prevalent among pornographic Web sites, our techniques must still be able to process and analyze multiframe Web pages. PICS use. We also found that 89 percent did not use any PICS support. Of the 11 pages that did, nine used the RSACi content-rating system; the rest used SafeSurf (see the “Web Content-Filtering Approaches” sidebar). We suspect the low use of PICS among pornographic Web sites is because these sites’ publishers might not want their sites filtered out. The results might not be statistically representative of overall PICS use among pornographic Web sites. However, they do indicate that implementing PICS support in a Webfiltering system is useful; some pornographic sites do have PICS labels embedded in the HTML code’s header section. So, checking PICS labels works best as a supplementary approach for positive identification. Indicative terms. We compiled a list of 55 indicative terms comprising 42 sexually explicit terms and 13 legal terms. Tables 1 and 2 summarize the distribution of these terms in eight locations. As Table 1 shows, more than 90 percent of the terms appear in the “keywords” metadata, while approximately 88 percent appear in “Other viewable text.”
SEPTEMBER/OCTOBER 2002

Cyber Patrol, www.cyberpatrol.com Cyber Snoop, www.cyber-snoop.com/index.html CYBERsitter, www.cybersitter.com I-Gear, www.symantec.com/sabu/igear Net Nanny, www.netnanny.com/home/home.asp Platform for Internet Content Selection, www.w3.org/PICS SafeSurf, www.safesurf.com SmartFilter, www.smartfilter.com SurfWatch, www.surfcontrol.com Websense, www.Websense.com X-Stop, www.xstop.com

Table 1. Use of sexually explicit terms in eight locations of 100 samples. Location Web page title Warning message block Other viewable text “description” metadata “keywords” metadata URL Image tooltip Graphical text Number of unique terms (42 total) 28 18 37 27 38 29 31 22 (66.67%) (42.86%) (88.10%) (64.29%) (90.48%) (69.05%) (73.81%) (52.38%) Frequency of occurrence (5,673 total) 399 (7.03%) 344 (6.06%) 1,919 (33.83%) 464 (8.18%) 1,438 (25.35%) 581 (10.24%) 419 (7.39%) 109 (1.92%)

Table 2. Use of legal terms in eight locations of 100 samples. Location Web page title Warning message block Other viewable text “description” metadata “keywords” metadata URL Image tooltip Graphical text Number of unique terms (13 total) 2 13 8 1 1 1 2 1 (15.38%) (100%) (61.54%) (7.69%) (7.69%) (7.69%) (15.38%) (7.69%) Frequency of occurrence (708 total) 23 (3.25%) 355 (50.14%) 135 (19.07%) 31 (4.38%) 72 (10.17%) 51 (7.20%) 39 (5.51%) 2 (0.28%)

These two locations together contribute more than 59 percent of the total sexually explicit term occurrences, while other locations have occurrences of approximately six to 10 percent, except graphical text, which contributes only 1.92 percent. Table 2 shows that all 13 legal terms appear in the warning message block; this location contributes more than 50 percent of the total occurrences of legal terms. Also, more than 72 percent of the legal term occurrences appear in the Web pages’ displayed text.

Our ANN approach
We could use statistical methods to classify Web pages according to the statistical occurrence of certain features. Such methods include k-nearest neighbor classification,5 linear least-squares fit,6 linear discriminant analysis,7 and naïve Bayes probabilistic classification.8 However, because real-world data such as we’re using tend to be noisy and are
computer.org/intelligent

not clearly defined, linear or low-order statistical models cannot always describe them. So, we use ANNs because they are robust enough to fit a wide range of distributions accurately and can model any high-degree exponential models. On the basis of our knowledge of pornographic-Web-site characteristics, we developed the Intelligent Classification Engine. The engine comprises two major processes: training and classification (see Figure 1). During training, the system learns from sample pornographic and nonpornographic Web pages to form a knowledge base of the ANN models. The system then classifies incoming Web pages according to their content. Training The Intelligent Classification Engine extracts the Web page’s representative features as inputs for training the ANN. Training’s objective is to let the ANN configure
51

N e u r a l

N e t w o r k s

Training exemplar set

WWW

Training Web pages Engine database Feature extraction Selected content Preprocessing Terms' occurrence Transformation Web page vector Neural network New weights Training results NN model generation Neural network models Cluster-tocategory mapping Transformation weights Indicative term dictionary

Online Web pages

Feature extraction Selected content Preprocessing Terms' occurrence Transformation Web page vector Neural network Activated cluster Categorization Cluster category Metacontent checking Classification

tive terms. So, excluding the URLs will not compromise the Web page’s relevance to pornography and will help minimize the system’s computational cost. Feature extraction, then, parses a Web page to identify the relevant information that indicates the page’s nature or characteristics. The engine separates the extracted information into four types: • Web page title • Displayed contents, including the warning message and other viewable contents • “description” and “keywords” metacontents (the information contained in the metadata fields) • Image tooltip Preprocessing. This step converts the raw text extracted from a Web page into numeric data representing the frequencies of occurrence of indicative terms. Figure 2 shows this step, which has two phases: tokenization and indicative-term identification and counting. The extracted text must first be parsed into a list of single words. Because we initially focus on English-language Web sites, the tokenization algorithm treats white spaces and punctuation marks as separators between words. It converts each extracted word into lower case before inserting it into a list. To facilitate identifying phrasal terms, this phase preserves the word sequence in the list as it was in the Web page’s raw text. Because feature extraction produced four types of contents for each Web page, tokenization produces four word lists: one each for the Web page title, the displayed contents, the “description” and “keywords” metacontents, and the image tooltip. Because each list has a different degree of relation to the nature of the Web page, the lists will carry different weights when training the ANN. The engine then feeds the four lists into indicative-term identification and counting. This phase uses an indicative-term dictionary to identify the indicative terms in the four lists. (We manually compiled this dictionary according to the analysis we described in the section “Know the enemy.”) It collects the occurrences for each set of indicative terms in the dictionary. Because there are 55 distinctive sets of indicative terms, the system obtains 55 frequencies of occurrence. Also, the system counts the total number of indicative-term occurrences and the total number of words in each list.
IEEE INTELLIGENT SYSTEMS

Category assignment Training

Classification results

Figure 1. The Intelligent Classification Engine.

itself and adjust its weight parameters according to the training exemplars, to facilitate generalization beyond the training samples. This requires a large set of training exemplars to obtain an ANN that produces statistically accurate results. In our case, the training exemplar is a vector representing the sample Web page. Because our objective is to distinguish pornographic Web pages from nonpornographic Web pages, we have collected a total of 3,777 nonpornographic Web pages and 1,009 pornographic Web pages for training. To extensively cover a variety of Web pages with different kinds of content nature, we gathered the nonpornographic Web pages from 13 categories of the Yahoo! search engine. These include Arts & Humanities, Business & Economy, Computers & Internet, Education, Entertainment, Government, Health, News & Media, Recreation & Sports, Reference, Science, Social Science, and Society & Culture. The result was a training set of 4,786 exemplars. Training has five steps: feature extraction,
52

preprocessing, transformation, ANN model generation, and category assignment. Feature extraction. We use frequencies of occurrence of indicative terms in a Web page to judge its relevance to pornography. Our analysis of pornographic Web pages indicates that indicative terms most likely appear in • Displayed contents consisting of the Web page title, the warning message block, and other viewable text • Nondisplayed contents including the “description” and “keywords” metadata, image tooltips, and URLs However, we excluded URLs because of the difficulties of identifying indicative terms in a URL address. Because words in a URL are concatenated and are not separated by white spaces, extracting words or terms from it will be difficult. Furthermore, occurrences of indicative terms in the URLs in pornographic Web pages contribute only a small percentage to the total occurrences of indicacomputer.org/intelligent

Transformation. This step converts the group of numbers resulting from preprocessing into the Web page vector to feed into the ANN. Transformation involves three phases: vector encoding, weight multiplication, and normalization. Vector encoding uses the data from preprocessing to form the basic vector. It has a dimension of 61 and consists of • The 55 frequencies of occurrence of indicative terms (55 elements) • The total number of indicative term occurrences in each of the four lists (four elements), which represents the relative importance of the four locations in the Web page • The total number of an indicative term’s occurrences in the four lists (one element), which represents the term’s degree of relation to the nature of the Web page • The total number of words in the four lists, which indicates the total number of words in the Web page (one element) and represents the amount of Web page contents These elements form the vector representing a Web page owing to their significance in aiding Web page content identification. The system uses the last two elements to train the ANN on the relation between a Web page’s total number of indicative terms and total number of words. Weight multiplication weights the basic vector according to the 61 elements’ relative indicativeness. An element’s indicativeness is the degree of indication it contributes toward a specific type of content—in this case, pornography. For example, “xxx” has a higher relative indicativeness than “babe” does, while indicative terms in the displayed contents have a lower relative indicativeness than those in the metacontents. Set initially according to the relative indicativeness, the weights are adjusted before each ANN training session. So, the system also treats them as parameters for fine-tuning the network’s performance to improve accuracy. It stores the 61 elements’ weights in a weight database with the weight entry’s position corresponding to the index of the respective elements. The engine then normalizes the weighted vector according to a common Euclidean length, giving the final Web page vector to feed to the ANN. This normalization improves the numerical condition of the problem the ANN will tackle and makes the training process better behaved.
SEPTEMBER/OCTOBER 2002

Web page title

Displayed contents

Metacontents

Image tooltip

Preprocessing Tokenization

Title word list

Displayed content word list

Metacontent word list

Image tooltip word list

Indicative term dictionary

Indicative term identification and counting

Frequency of occurrence Total indicative term occurrences of 55 indicative terms in each of four lists

Total words in each of four lists

Figure 2. Preprocessing consists of tokenization, which locates individual words, and indicative-term identification and counting.

Figure 3 illustrates an example of transformation. ANN model generation. This step trains the ANN for online classification. Two ANNs that have proven effective for document classification are Kohonen’s Self-Organizing Maps9 (KSOM) and the Fuzzy Adaptive Resonance Theory10 (Fuzzy ART).11,12 So, we evaluated them for Web content classification. Because each vector contains 61 elements, the KSOM network requires 61 inputs. We determined the output neuron array’s dimension after conducting a series of training experiments on the ANN with several different dimensions. We found that if the dimension is smaller than 7 × 7, some generated clusters contain a large mixture of pornographic and nonpornographic Web pages. However, output neuron arrays that are larger than 7 × 7 do not significantly improve accuracy. Also, training an ANN with a larger number of output neurons will take much more time. So, we consider a 7 × 7 output neuron array the best compromise between maximizing the classification distinctiveness and minimizing training complexity. Figure 4 shows the KSOM training algorithm. Before training, we must set several
computer.org/intelligent

parameters: the number of training iterations, the initial and final learning rates, and the initial neighborhood size and its decrement interval. For good statistical accuracy, the number of iterations should be at least 500 times larger than the number of output neurons.9 Because we have 49 neurons, we train the network for 24,500 iterations. In our implementation of the KSOM network, the learning rate linearly decreases in each iteration from an initial value of 0.6 toward the final learning rate of 0.01. The initial neighborhood size is set to five and decreases once every 4,000 iterations. Before a training session begins, we initialize the network weights with random real numbers between 0 and 1. Each training iteration feeds the whole set of training exemplars to the network. The training session determines the winning neuron by finding the response that gives the smallest Euclidean distance between the exemplar and the weight vector. Only the weights associated with the winning neuron and its neighborhood are adjusted to adapt to the input pattern. When the training session successfully completes, the engine saves the network’s weights into the ANN model. When a training session ends, clusters of
53

N e u r a l

N e t w o r k s
Web page preprocessing

Index 1 2 3 4 . . . 55

Web page data Indicative term 18 or older | 18 or over |... 18 years old | 18 year old |... of age | of legal age adult links | adult link . . . xxx Total indicative terms

Frequency 1 1 0 0 . . . 3

56 57 58 59 60 61

In image tooltip word list In displayed content word list In metacontent word list In title word list Sum of all indicative terms Sum of all words Vector encoding

1 14 8 1 24 236

Index Basic vector

1 1

2 1

3 0

4 0

... ...

55 3

56 1

57 14

58 8

59 1

60 24

61 236

Weight multiplication Weighted vector 1 1 0 0 ... 12 3.5 14 32 3 24 236

Normalization Normalized vector 0.0042 0.0042 0 0 ... 0.0499 0.0146 0.0582 0.1331 0.0125 0.0998 0.9775

Figure 3. An example of transformation. Vector encoding is the result of preprocessing that forms the 61-element basic vector. Weight multiplication then weighs the elements according to their relative indicativeness. The weighted vector is then normalized before being fed to the ANN.

Step 1: Step 2:

Initialize the weight vector of all output neurons. Determine the output winning neuron m by searching for the shortest normalized Euclidean distance between the input vector and each output neuron’s weight vector:

X − Wm = min X − Wj ,
j =1KM

Step 3:

X is the input vector, Wj is the weight vector of output neuron j, and M is the total number of output neurons. Let Nm(t) denote a set of indices corresponding to a neighborhood size of the current winner neuron m. Slowly decrease the neighborhood size during the training session. Update the weights of the weight vector associated with the winner neuron m and its neighboring neurons: ∆Wj (t ) = α(t)[X(t ) − Wj (t ) ] for j ∈ Nm(t ),

where

where α is a positive-valued learning factor, α ∈ [0, 1]. Slowly decrease α with each training iteration. So, the new weight vector is given by Wj (t + 1) = Wj (t ) + α(t ) [X(t ) − Wj (t ) ] for j ∈ Nm(t ) .

Web pages are formed at the array of output neurons. These clusters might contain different proportions of pornographic and nonpornographic Web pages. Category assignment will classify each cluster. The Fuzzy ART network will have 61 nodes at its F0 layer. Owing to complement coding, each F0 node will feed data to two F1 nodes. So, the F1 layer has 122 (61 × 2) nodes. Because we wish to compare Fuzzy ART to the KSOM network, the network should generate as close to 49 (7 × 7) clusters as possible. To meet this requirement, the engine constructs 60 F2 nodes. Next, we need to select values for three Fuzzy ART network parameters: the choice parameter α, the vigilance parameter ρ, and the learning parameter β. α affects the inputs

Repeat steps 2 and 3 for every exemplar in the training set for a user-defined number of iterations.

Figure 4. The KSOM (Kohonen’s SelfOrganizing Maps) training algorithm.
IEEE INTELLIGENT SYSTEMS

54

computer.org/intelligent

Step 1: Step 2:

produced at each node of the F2 layer according to all the nodes of the F1 layer. To minimize the network training time, α must have a small value.10 We chose α = 0.1. ρ indicates the threshold of proximity to a cluster that an input vector must fulfill before a desirable match occurs. The choice of ρ affects the number of clusters that the Fuzzy ART network generates. In this case, ρ should have a value that lets the network generate approximately 49 clusters. After experimenting with the network several times, we chose ρ = 0.66. β manipulates the adjustment of weight vector W J, where node J is the chosen cluster that fulfills the vigilance match. We use the fast-commit, slow-recode updating scheme. More specifically, β = 1 when the chosen cluster is an uncommitted F2 node, while β = 0.5 for a committed F2 node. Unlike the case of KSOM, we do not need to decrease the learning parameter as training progresses. Figure 5 shows the training algorithm for the Fuzzy ART network. Before the training session begins, the engine initializes the weights in all weight vectors to 1 and marks all the F2 layer nodes as uncommitted. Once the training session commences, the engine feeds the training exemplars to the network. This process will carry on until all weight vectors become static. The engine collects the network parameters, weights of all weight vectors, and committed status of all F2 nodes and stores them as the ANN model for our Fuzzy ART implementation. Similarly to KSOM training, the Fuzzy ART training session forms Web page clusters that category assignment will classify. Category assignment. After the ANN generates the clusters, we still need to determine each cluster’s nature. This is because a cluster defines a group of Web pages with similar features and characteristics but does not classify those similarities. The engine assigns each cluster to one of three categories: pornographic, nonpornographic, or unascertained. Unascertained clusters contain a fair mixture of pornographic and nonpornographic Web pages. Table 3 summarizes the thresholds for category assignment. For example, if a cluster contains 70 percent of Web pages labeled as “pornographic,” we map the cluster to the “pornographic” category. The system records the results in a clusterto-category mapping database. Each database
SEPTEMBER/OCTOBER 2002

Initialize the weights of all weight vectors to 1, and set all F2 layer nodes to “uncommitted.” Apply complement coding to the M-dimensional input vector. The resultant complement-coded vector I is of 2M-dimensions:
C C I = a,aC = a1,K,aM ,a1 ,K,aM

( ) (

),

C where ai = 1 − ai ,for i ∈ [1, M].

Step 3:

Compute the choice function value for every node of the F2 layer. For the complement-coded vector I and node j of the F2 layer, the choice function Tj is

Tj (I) =

I ∧W j

α + Wj

,

Wj is the weight vector of node j, and α is the choice parameter. The fuzzy AND operator ∧ is defined by (P ∧ Q)i = min(pi, qi), and the norm | | is defined as P ≡ ∑ pi .
i =1 M

where

Step 4:

Find the node J of the F2 layer that gives the largest choice function value: Tj = max{Tj : J = 1 … N } .

Step 5:

The output vector Y of layer F2 is thus given by yJ =1 and yj = 0 for j ≠ J. Determine if resonance occurs by checking if the chosen node J meets the vigilance threshold:

I ∧ WJ I

≥ρ ,

where ρ is the vigilance parameter. If resonance occurs, update the weight vector WJ; the new WJ is given by
new old old WJ = β I ∧ WJ + 1 − β WJ

(

)

(

)

.

Otherwise, set the value of the choice function TJ to 0. Repeat steps 4 and 5 until a chosen node meets the vigilance threshold. For every input pattern in the training set, perform steps 2 to 5. Repeat the whole process until the weights in all weight vectors become unchanged. Figure 5. The Fuzzy ART (Fuzzy Adaptive Resonance Theory) training algorithm. Table 3. Thresholds for category assignment. Proportion of Web pages in clusters Cluster category Pornographic Nonpornographic Unascertained Pornographic [70%, 100%] [0%, 30%) [30%, 70%) Nonpornographic [0%, 30%) [70%, 100%] [30%, 70%]

Table 4. Training efficiency for the Kohonen’s Self-Organizing Maps and the Fuzzy Adaptive Resonance Theory networks. Results Attribute Number of inputs Number of output neurons Number of iterations Number of training exemplars Total processing time KSOM 61 49 24,500 4,786 37 hrs., 43 min., 23 sec. Fuzzy ART 61 47 93 4,786 47 sec.

computer.org/intelligent

55

N e u r a l

N e t w o r k s

Table 5. Training accuracy. ANN type KSOM Web page Pornographic Nonpornographic Total Pornographic Nonpornographic Total Correctly classified 911 3,529 4,440 780 3,429 4,209 (90.3%) (93.4%) (92.8%) (77.3%) (90.8%) (87.9%) Incorrectly classified 27 56 83 100 88 188 (2.7%) (1.5%) (1.7%) (9.9%) (2.3%) (3.9%) Unascertained 71 192 263 (7.0%) (5.1%) (5.5%) Total 1,009 3,777 4,786 1,009 3,777 4,786

Fuzzy ART

129 (12.8%) 260 (6.9%) 389 (8.1%)

Table 6. Classification accuracy. The “Meta” column indicates the number of Web pages that metacontent checking classified. Correctly classified ANN type KSOM Web page Pornographic Nonpornographic Total Pornographic Nonpornographic Total ANN Meta Incorrectly classified ANN Meta Unascertained 4 19 23 (2.2%) 28 24 52 (4.9%) Total 535 523 1,058 535 523 1,058

499 9 496 1 1,005 (95.0%) 428 475 32 8 943 (89.1%)

23 0 5 2 30 (2.8%) 47 0 7 9 63 (6.0%)

Fuzzy ART

entry has a unique ID identifying one of the generated clusters and its category. Classification This process uses the trained ANN to classify incoming Web pages; it outputs one of the three predefined categories. Similar to the training process, it also performs feature extraction, preprocessing, and transformation for each incoming Web page. After these steps have generated the Web page vector, the system feeds the vector into the ANN to

produce an activated cluster. The categorization step uses the cluster-to-category mapping database to determine the activated cluster’s category, which it uses to classify the corresponding Web page. To further reduce the number of unascertained Web pages, we introduce a postprocessing step called metacontent checking. This step applies keyword filtering to the “description” and “keywords” metacontents in the HTML header of the unascertained Web pages. This mechanism is effective because

the metacontents include terms directly related to the associated Web pages’ subject. For keywords, we use the terms in the indicative terms dictionary. If the filtering finds at least one indicative term in the metacontents, the engine classifies the associated Web page as pornographic. Otherwise, it identifies the page as nonpornographic. If the engine cannot find the metacontents or they do not exist, the page remains unascertained. Performance evaluation For training, we used the 4,786 Web pages (93,578,232 bytes) mentioned in the “Training” section. First, we measured the processing time required for the three pre-ANN steps (feature extraction, preprocessing, and transformation) for both training and classification. The three steps took 167 seconds to process all Web pages (an average of 35 ms per page). Next, we measured the training efficiency and accuracy for KSOM and Fuzzy ART. Tables 4 and 5 summarize the results. Although KSOM requires much longer training time, it produces better training accuracy and gives a smaller set of unascertained Web pages. Finally, we measured both networks’ classification accuracy. We compiled a testing exemplar set with 535 pornographic Web pages and 523 nonpornographic Web pages. Table 6 summarizes the results. The “Meta” column indicates the number of Web pages that metacontent checking classified. From these results, we conclude that the KSOM network performed much better than
IEEE INTELLIGENT SYSTEMS

Figure 6. The engine misclassified this page as pornographic because it contains sexually explicit terms in its displayed contents and the “description” and “keywords” metacontents. 56
computer.org/intelligent

Fuzzy ART. Overall, the ANNs effectively generalized information for solving new cases. Furthermore, a simple technique such as metacontent checking was also beneficial. To help us improve the engine, we also analyzed cases of misclassifications. For example, the engine misclassified the Web page in Figure 6 as pornographic. The main reason for the misclassification is that the Web page contains many sexually explicit terms in its displayed contents and the “description” and “keywords” metacontents. Another common reason for misclassification is the insufficient text information in some Web pages that contain mainly image information. If a page’s metacontents do not contain any information, and the page does not display any warning messages as text, the system will misclassify it.

T h e

A u t h o r s
Pui Y. Lee is a graduate student in Nanyang Technological University’s

School of Computer Engineering. His research interests are neural networks, Web page classification, intelligent filtering, and Internet technology. He received his BASc (Hons.) in computer engineering from NTU. Contact him at the School of Computer Eng., Nanyang Technological Univ., Blk N4, #02A-32, Nanyang Ave, Singapore 639798; [email protected].

Siu C. Hui is an associate professor in Nanyang Technological University’s School of Computer Engineering. His research interests include data mining, Internet technology, and multimedia systems. He received his BSc in mathematics and his DPhil in computer science, both from the University of Sussex. He is a member of the IEEE and ACM. Contact him at the School of Computer Eng., Nanyang Technological Univ., Blk N4, #02A-32, Nanyang Ave., Singapore 639798, Republic of Singapore; [email protected]; www. ntu.edu.sg/sce.

Alvis Cheuk M. Fong is a faculty member in Massey University’s Institute

C

ompared to Web-filtering systems we have surveyed (see the “Current Web Content-Filtering Systems” sidebar), our Intelligent Classification Engine based on the KSOM network model is clearly more accurate. Our results indicate that the engine can automate the maintenance of a URL black list (a list of URLs to block—see the “Web Content-Filtering Approaches” sidebar) for fast, effective online filtering. Also, automated content classification minimizes the need to manually examine Web pages. Our research is advancing in two directions. First, we are developing heuristics to complement machine intelligence for better classification accuracy. A particular focus is Web pages that do not contain much text. We are applying pattern recognition techniques for understanding graphical information. Second, we are extending the work to cover multilingual, XML, and nonpornographic but objectionable (for example, violence- and drug-related) Web pages. Because we have established the framework, the changes require only the extraction of exemplars that characterize such Web content.

of Information and Mathematical Sciences. His research interests include various aspects of Internet technology, information theory, and video and image signal processing. He received his BEng and MSc from Imperial College, London, and his PhD from the University of Auckland. He is a member of the IEEE and IEE, and is a Chartered Engineer. Contact him at the Inst. of Information and Mathematical Sciences, Massey Univ., Albany Campus, Private Bag 102-904, North Shore Mail Centre, Auckland, New Zealand; [email protected]; www.massey.ac.nz/~acfong.
For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.

References
1. G. Salton, Automatic Text Processing, Addison-Wesley, Boston, 1989. 2. R.P. Lippmann, “An Introduction to Computing with Neural Networks,” IEEE ASSP Magazine, vol. 4, no. 2, Apr. 1987, pp. 4–22. 3. M. Pastore, “Search Engines, Browsers Still Confusing Many Web Users,” CyberAtlas, INT Media Group, Darien, Conn., 14 Feb. 2001, http://cyberatlas.internet.com/big_picture/ applications/article/0,,1301_588851,00.html. 4. The Effects of Pornography and Sexual Messages, National Coalition for the Protection of Children & Families, Cincinnati, Ohio, www.nationalcoalition.org/pornharm. phtml?ID=102. 5. Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization,” Information Retrieval, vol. 1, nos. 1–2, Apr. 1999, pp. 69–90. 6. Y. Yang and C.G. Chute, “An Application of Least Squares Fit Mapping to Text Information Retrieval,” Proc. 16th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 93), ACM Press, New York, 1993, pp. 281–290.
computer.org/intelligent

7. G.J. Koehler and S.S. Erenguc, “Minimizing Misclassifications in Linear Discriminant Analysis,” Decision Sciences, vol. 21, no. 1, 1990, pp. 63–85. 8. A. McCallum and K. Nigam, “A Comparison of Event Models for Naïve Bayes Text Classification,” AAAI-98 Workshop Learning for Text Categorization, AAAI Press, Menlo Park, Calif., 1998, pp. 41–48. 9. T. Kohonen, Self-Organizing Maps, SpringerVerlag, Berlin, 1995. 10. G.A. Carpenter, S. Grossberg, and D.B. Rosen, “Fuzzy ART: Fast Stable Learning and Categorization of Analog Patterns by an Adaptive Resonance System,” Neural Networks, vol. 4, no. 6, July 1991, pp. 759–771. 11. G. Troina and N. Walker, “Document Classification and Searching: A Neural Network Approach,” ESA Bull., no. 87,Aug. 1996; http:// esapub.esrin.esa.it/bulletin/bullet87/troina87. htm. 12. D. Roussinov and H. Chen, “A Scalable SelfOrganizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation,” Communication and Cognition—Artificial Intelligence, vol. 15, nos. 1–2, Spring 1998, pp. 81–111.

Acknowledgments
The list of products surveyed in this article is not exhaustive. We are not related to any of the vendors, and we do not endorse any of these products.
SEPTEMBER/OCTOBER 2002

57

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close