My Chapter Two

Published on July 2016 | Categories: Types, School Work | Downloads: 34 | Comments: 0 | Views: 302

of 57

MSC Project

Content

CHAPTER TWO
L I T E R AT U R E

REVIEW

This chapter involves the basic topics on Data Mining – its meaning,
reasons for its application, various tasks, processes, techniques and
application area of data mining. It dwells much on Artificial neural
network approach. These algorithms strength, weaknesses and when
to apply are not left out.
2.1

DATA MINING

Data Mining is the process of discovery of meaningful new correlation,
pattern, and trends by shifting through large amount of data stored
repositories and by using pattern recognition technologies as well as
statistical and mathematical

[6]

techniques.

Data mining also refers to the analysis of the large quantities of data
that are stored in computers in form of file or in database. It is called
exploratory data analysis, among other things

[5]

.

Data mining is not limited to business. Data mining has been heavily
used in the medical field, to include diagnosis of patient records to
help identify best practices.
2.2

WHY DATA MINING

Data mining caught on in a big way in the last few years due to the
following number of factors

[6]

:

i)

The data is being produced and collected at unprecedented way.

ii)

The data is being warehoused – data warehousing brings

together data from different format in a common format with
consistent definitions for keys and fields.
iii)

The computing power is affordable – price for disk, memory,

processing power, I/O bandwidth is affordable by many ordinary
businesses.
iv)

The existence of commercial data mining software products.

2.3

DATA MINING PROCESS
6

In order to systematically conduct data mining analysis, a general
process is usually followed. There are some standard processes, two of
which CRISP (an industry standard process consisting of a sequence of
steps that are usually involved in a data mining study) and SEMMA
which stands for Sample, Explore, Modify, Model, Assess(figure 2.2
depicts how SEMMA phases interact)

[7]

. While each step of either

approach isn’t needed in every analysis, this process provides a good
coverage of the steps needed, starting with data exploration, data
collection,

data

processing,

analysis,

inferences

drawn,

and

implementation.
2.3.1 CRISP-DM
CRISP-DM - Cross-Industry Standard Process for Data Mining is widely
used by industries and cooperate organizations. This model consists of
six phases intended as a cyclical process.
 Business Understanding: Business understanding includes
determining business objectives, assessing the current situation,
establishing data mining goals, and developing a project plan.

Figure 2.1 CRISP-DM processes

7

 Data Understanding: Once business objectives and the project
plan

are

established,

data

understanding

considers

data

requirements. This step can include initial data collection, data
description, data exploration, and the verification of data quality.
Data exploration such as viewing summary statistics (which
includes the visual display of categorical variables) can occur at
the end of this phase. Models such as cluster analysis can also
be applied during this phase, with the intent of identifying
patterns in the data.
 Data Preparation: Once the data resources available are
identified, they need to be selected, cleaned, built into the form
desired, and formatted.
Data cleaning and data transformation in preparation of data
modeling needs to occur in this phase. Data exploration at a
greater depth can be applied during this phase, and additional
models utilized, again providing the opportunity to see patterns
based on business understanding.
 Modeling Data mining software tools such as visualization
(plotting data and establishing relationships) and cluster analysis
(to identify which variables go well together) are useful for initial
analysis. Tools such as generalized rule induction can develop
initial association rules. Once greater data understanding is
gained (often through pattern recognition triggered by viewing
model output), more detailed models appropriate to the data
type can be applied. The division of data into training and test
sets is also needed for modeling.
 Evaluation Model results should be evaluated in the context of
the business objectives established in the first phase (business
understanding). This will lead to the identification of other needs
(often through pattern recognition), frequently reverting to prior
phases of CRISP-DM. Gaining business understanding is an
iterative procedure in data mining, where the results of various

8

visualization, statistical, and artificial intelligence tools show the
user new relationships that provide a deeper understanding of
organizational operations.
 Deployment Data mining can be used to both verify previously
held hypotheses, or for knowledge discovery (identification of
unexpected and useful relationships). Through the knowledge
discovered in the earlier phases of the CRISP-DM process, sound
models can be obtained that may then be applied to business
operations

for

many

purposes,

including

prediction

or

identification of key situations. These models need to be
monitored for changes in operating conditions, because what
might be true today may not be true a year from now. If
significant changes do occur, the model should be redone. It’s
also wise to record the results of data mining projects so
documented evidence is available for future studies.
2.32

Figure 2.2 Schematic of SEMMA (original from SAS )

9

2.4

DATA MINING TASKS

 CLASSIFICATION: This consists of examining the features of newly
presented object and assigning it to one of a predefined set of
classes. For our purposes, the objects to be classified are generally
represented by records in databases and the act of classification
consists of updating each record by filling in a field with a class
code of some kind.
The classification task is classified by a well defined definition of
some classes, and a training set consisting of pre-classified
examples. The task is to build a model of some kind that can be
applied to unclassified data in order to classify it.
 ESTIMATION: While classification deals with discrete outcomes
such as yes or no, measles, rubella, or chicken pox; Estimation deals
with continuous value outcomes. Given some input data, we use
estimation to come up with a value for some unknown continuous
variable such as income, height, or credit card balance.
In practice, estimation is often used to perform classification task
and Neural Networks are well-suited to estimate tasks.
 PREDICTION:

Prediction

is

the

same

for

classification

and

estimation except that the records are classified according to some
predicted future behaviour on estimated future value. In a prediction
task, the only way to check the accuracy of the classification is to
wait and see.
Any of the techniques used for classification and estimation can
adapted for use in prediction is already known, along with historical
data for those examples. The historical data is used to build a model
that explains the current observed behaviour. When this model is
applied to current inputs, the result is a prediction of future
behaviour.
 AFFINITY GROUPING: The task affinity grouping is to determine
which things go together. It can be used to identify cross-selling

10

opportunities and to design attractive packages or groupings of
product and services. Affinity grouping is one simple approach to
generate rules from data. If two items go together, two association
rules can be generated together.
 CLUSTRING: This is the task of segmenting a heterogeneous
population into a number of more homogeneous subgroups. It does
not rely on predefined classes and records are grouped together on
the basis of self-similarity. It is now up to you to determine what
meaning, if any, to attach to the resulting clusters.
Clustering is often done as a prelude to some other form of data
mining or modelling.
 DESCRIPTION: Sometimes the purpose of data mining is to simply
describe what is going on in a complicated database in a way that
increases the understanding of people, products or processes that
produces the data in the first place.
Some of the techniques that will later be discussed in this chapter
such as market basket analysis tools are purely descriptive. Others
like neural networks provide next to nothing in the way of
description
2.5

[6]

.

DATA MINING ISSUES

As data mining initiatives continue to evolve, there are several issues
Congress may decide to consider related to implementation and
oversight. These issues include, but are not limited to, data quality,
interoperability, mission creep, and privacy. As with other aspects of
data mining, while technological capabilities are important, other
factors also influence the success of a project’s outcome

[6]

.

 Data Quality
Data quality refers to the accuracy and completeness of the data. Data
quality can also be affected by the structure and consistency of the
data being analyzed. The presence of duplicate records, the lack of
data standards, the timeliness of updates, and human error can

11

significantly impact the effectiveness of the more complex data mining
techniques, which are sensitive to subtle differences that may exist in
the data. To improve data quality, it is sometimes necessary to “clean”
the data, which can involve the removal of duplicate records,
normalizing the values used to represent information in the database,
accounting for missing data points, removing unneeded data fields,
identifying anomalous data points (e.g., an individual whose age is
shown as 142 years), and standardizing data formats (e.g., changing
dates so they all include MM/DD/YYYY).
 Interoperability
This refers to the ability of a computer system and/or data to work
with other systems or data using common standards or processes For
data mining, interoperability of databases and software is important to
enable the search and analysis of multiple databases simultaneously,
and to help ensure the compatibility of data mining activities of
different agencies. Similarly, as agencies move forward with the
creation of new databases and information sharing efforts, they will
need to address interoperability issues during their planning stages to
better ensure the effectiveness of their data mining projects.
 Mission Creep
Mission creep refers to the use of data for purposes other than that for
which the data was originally collected. This can occur regardless of
whether the data was provided voluntarily by the individual or was
collected through other means. One of the primary reasons for
misleading results is inaccurate data. All data collection efforts suffer
accuracy concerns to some degree. Ensuring the accuracy of
information can require costly protocols that may not be cost effective
if the data is not of inherently high economic value. In well-managed
data mining projects, the original data collecting organization is likely
to be aware of the data’s limitations and account for these limitations
accordingly. However, such awareness may not be communicated or
heeded when data is used for other purposes
 Privacy

12

As additional information sharing and data mining initiatives have
been announced, increased attention has focused on the implications
for privacy.
Concerns about privacy focus both on actual projects proposed, as well
as concerns about the potential for data mining applications to be
expanded beyond their original purposes (mission creep).
So far there has been little consensus about how data mining should
be carried out, with several competing points of view being debated.
Some observers contend that tradeoffs may need to be made
regarding privacy to ensure security. In contrast, some privacy
advocates argue in favor of creating clearer policies and exercising
stronger oversight.
2.6

BASIC STYLES OF DATA MINING

The first, hypothesis testing, is a top-down approach that attempt to
substantiate or disproved preconceived ideas. The second, knowledge
discovery, is a bottom-up approach that starts with the data and tries
to get it to tell us we didn’t already know

[6]

.

2.6.1 Hypothesis
A hypothesis is a propose explanation whose validity can be tested.
Testing the validity of an hypothesis is done by analyzing data that
may simply be collected by observation or generated through
experiment.
 The process of hypothesis testing
The hypothesis testing method ha several steps:
1)

Generate good ideas (hypothesis)

2)

Determine what data would allow these hypotheses to be tested.

3)

Locate the data

4)

Prepare the data for analysis

5)

Build computer model based on the data

6)

Evaluate computer model to confirm or reject hypotheses

2.6.2 Knowledge Discovery

13

Undirected learning ha s long been goal of artificial intelligence
researchers in the academic discipline called machine learning. In the
real world, discovering valuable patterns is worthwhile, but it is still
hard work.
Knowledge discovery can be either directed or undirected.
 Directed Knowledge Discovery
This is goal oriented. There is a specific field whose value we want
to predict, a fixed set of classes to be assigned to each record, or a
specific relationship we want to explore.
Here are the steps in the process of direct knowledge discovery:
1. Identify source of pre-classified data.
2. Prepare data for analysis
3. Build and train computer model
4. Evaluate the computer model
 Undirected Knowledge Discovery
Here, there is no target field. The data mining tool is simply let
loosed on the data with the hope that it will discover meaningful
structure.


The Process of Undirected Knowledge Discovery

Here are the steps in the process of undirected knowledge
discovery:
1.

Identify source of pre-classified data.

2.

Prepare data for analysis

3.

Build and train computer model

4.

Evaluate the computer model

5.

Apply the computer model to new data.

6.

Identify potential targets for directed knowledge discovery.

7.

Generate new hypothesis to test

14

D ATA
2.7

MINING

TECHNIQUES/METHODS

MEMORY-BASED REASONING

Memory-based reasoning systems are a type of model, supporting the
modeling phase of the data mining process. Their unique feature is
that

they

are

relatively

machine

driven,

involving

automatic

classification of cases. It is a highly useful technique that can be
applied to text data as well as traditional numeric data domains.
Memory-based reasoning is an empirical classification method

[8]

. It

operates by comparing new unclassified records with known examples
and patterns.
The case that most closely matches the new record is identified, using
one of a number of different possible measures. Memory-based
reasoning provides best overall classification when compared with the
more traditional approaches in classifying jobs with respect to back
disorders

[9]

.

 Matching: While matching algorithms are not normally found in
standard data mining software, they are useful in many specific
data mining applications. Fuzzy matching has been applied to
discover patterns in the data relative to user expectations

[10]

.

Java software has been used to completely automate document
matching

[11].

Matching

can

also

be

identification in geometric environments

applied

to

pattern

[12]

.

There are a series of measures that have been applied to
implement
memory-based reasoning. The simplest technique assigns a new
observation
to the pre-classified example most similar to it. The Hamming
distance metric identifies the nearest neighbor as the example
from the training database with the highest number of matching
fields (or lowest number of

15

non-matching fields). Case-based reasoning is a well-known
expert system
approach that assigns new cases to the past case that is closest
in some sense. Thus case-based reasoning can be viewed as a
special case of the
nearest neighbor technique.
 Weighted Matching
Data mining can involve deletion of variables, but the usual
attitude is to retain data because you don’t know what it may
provide. Weighting provides another means to emphasize certain
variables over others. All that would change would be that the
“Matches” measure could now represent a weighted score for
selection of the best matching case
 Distance Minimization
This concept uses the distance measured from the observation
to be
classified to each of the observations in the known data set.
In this case, the nominal and ordinal data needs to be converted
to meaningful ratio data
 Strength of Memory-Based Reasoning
 It produces results that are readily understandable
 It is applicable to arbitrary data types, even non relational
data.
 It works efficiently on any number of fields.
 Maintaining the training set requires a minimal amount of
effort.
 Weaknesses of Memory-Based Reasoning
It is computationally expensive when doing classification and
prediction
It requires a large amount of storage for the training set.
Results can be dependent on the choice of distance function,
combination function and the number of neigbours.

16

2.8

ASSOCIATION RULES IN KNOWLEDGE DISCOVERY

An association rule is an expression of X → Y, where X is a set of items,
and Y is a single item. Association rule methods are an initial data
exploration approach that is often applied to extremely large data set.
Association rules mining provides valuable information in assessing
significant correlations. They have been applied to a variety of fields,
to include medicine

[13]

and medical insurance fraud detection

[14]

.

Many algorithms have been proposed to find association rules mining
in large databases. Most, such as the APriori algorithm identify
correlations among transactions consisting of categorical attributes
using binary values. Some data mining approaches involve weighted
association rules for binary values,

[15]

or time intervals

[16]

.

Data structure is an important issue due to the scale of data usually
encountered

[17]

. Structured query language (SQL) has been a

fundamental tool in manipulation of database content. Knowledge
discovery involves ad hoc queries, needing efficient query compilation.
Lopes et al. considered functional dependencies in inference problems.
SQL was used by those researchers to generate sets of attributes that
were useful in identifying item clusters.
Key measures in association rule mining include support and
confidence.
 Support refers to the degree to which a relationship appears in
the data.
 Confidence relates to the probability that if a precedent occurs,
a consequence will occur. The rule X → Y has minimum support
value minsup if minsup percent of transactions support X → Y,
the rule X → Y holds with minimum confidence value minconf if
minconf percent of transactions that support X also support Y.
For example, from the transactions kept in supermarkets, an
association rule such as “Bread and Butter → Milk” could be
identified through association mining.
2.9

MARKET BASKET ANALYSIS

17

Market-basket

analysis

refers

to

methodologies

studying

the

composition of a shopping basket of products purchased during a
single shopping event.
This technique has been widely applied to grocery store operations (as
well as other retailing operations, to include restaurants). Market
basket data in its rawest form would be the transactional list of
purchases by customer, indicating only the items purchased together
(with their prices). This data is challenging because of a number of
characteristics:

[18]

 A very large number of records (often millions of transactions
per day)
 Sparseness (each market basket contains only a small portion of
items
carried)
 Heterogeneity (those with different tastes tend to purchase a
specific
subset of items).
The aim of market-basket analysis is to identify what products tend to
be purchased together. Analyzing transaction-level data can identify
purchase patterns, such as which frozen vegetables and side dishes
are purchased with steak during barbecue season. This information
can be used in determining where to place products in the store, as
well as aid inventory management. Product presentations and staffing
can be more intelligently planned for specific times of day, days of the
week, or holidays. Another commercial application is electronic
couponing, tailoring coupon face value and distribution timing using
information obtained from market baskets

[19]

.

2.9.1 Market Basket Analysis Benefits
The ultimate goal of market basket analysis is finding the products
that
customers frequently purchase together. The stores can use this
information by putting these products in close proximity of each other

18

and making them more visible and accessible for customers at the
time of shopping.
These assortments can affect customer behavior and promote the
sales for complement items. The other use of this information is to
decide about the layout of catalogs and put the items with strong
association together in sales catalogs. The advantage of using sales
data for promotions and store layout is that the consumer behavior
information determines the items with associations. This information
may vary based on the area and the assortments of available items in
stores and the point of sale data reflects the real behavior of the group
of customers that frequently shop at the same store. Catalogs that are
designed based on the market basket analysis are expected to be
more effective on consumer behavior and sales promotion.
2.9.2 Strength of Market Basket Analysis
 It produces clear and understandable results
 It supports undirected data mining
 It works on variable-length data.
 The computations it uses are simple to understand
2.9.3 Weaknesses of Market Basket Analysis
 It requires exponentially more computational effort as the
problem size grows.
 It has a limited supports for attributes on the data
 It is difficult to determine the right number of items
 It discounts rear items

2.10 FUZZY SETS IN DATA MINING
Real-world application is full of vagueness and uncertainty. Several
theories

on

managing

uncertainty

19

and

imprecision

have

been

advanced, to include fuzzy set theory [20], probability theory[21], rough
set theory[22] and set pair theory[23].
Fuzzy set theory is used more than the others because of its simplicity
and
similarity to human reasoning. Fuzzy modeling provides a very useful
tool to deal with human vagueness in describing scales of value. The
advantages of the fuzzy approach in data mining is that it serves as an
“… interface between a numerical scale and a symbolic scale which is
usually composed of linguistic terms.[24]”
Fuzzy association rules described in linguistic terms help users
better understand the decisions they face[25]. Fuzzy set theory is being
used more and more frequently in intelligent systems. A fuzzy set A in
universe U is defined as A={(x, µA(x))| xεU, µA (x) ε [0,1]} where µA (x)
is a membership function indicating the degree of membership of x to
A. The greater the value of µA (x) , the more x belongs to A. Fuzzy sets
can also be thought of as an extension of the traditional crisp sets and
categorical/ordinal scales, in which each element is either in the set or
not in the set (a membership
function of either 1 or 0).
Fuzzy set theory in its many manifestations (interval-valued fuzzy
sets, vague sets, grey-related analysis, rough set theory, etc.) is highly
appropriate for dealing with the masses of data available.
There are many data mining tools available, to cluster data, to help
analysts
find patterns, to find association rules. The majority of data mining
approaches in classification work with numerical and categorical
information.
Most data mining software tools offer some form of fuzzy analysis.
Modifying continuous data is expected to degrade model accuracy, but
might be more robust with respect to human understanding. (Fuzzy
representations might lose accuracy with respect to numbers that
don’t really reflect accuracy of human understanding, but may better
represent the reality humans are trying to express.) Another approach

20

to fuzzify data is to make it categorical. Categorization of data is
expected to yield greater inaccuracy on test data. However, both
treatments are still useful if they better reflect human understanding,
and might even be more accurate on future implementations.
The

categorical

limits

selected

are

key

to

accurate

model

development. Not many data mining techniques take into account
ordinal data features.
2.10.1

Fuzzy Association Rules

With the rapid growth of data in enterprise databases, making sense of
valuable

information

becomes

more

and

more

difficult.

KDD

(Knowledge Discovery in Databases) can help to identify effective,
coherent, potentially useful and previously unknown patterns in large
databases

[26]

. Data mining plays an important role in the KDD process,

applying specific algorithms for extracting desirable knowledge or
interesting patterns from existing datasets for specific purposes. Most
of the previous studies focused on categorical attributes.
Mining fuzzy association rules for quantitative values has long been
considered by a number of researchers, most of whom based their
methods on the important APriori algorithm

[27]

. Each of these

researchers treated all attributes (or all the linguistic terms) as
uniform. However, in real-world applications, the users perhaps have
more interest in the rules that contain fashionable items.
Decreasing minimum support minsup and minimum confidence
minconf to get rules containing fashionable items is not best, because
the efficiency of the algorithm will be reduced and many uninteresting
rules will be generated simultaneously

[28]

. Weighted quantitative

association rules mining based on a fuzzy approach has been proposed
(by Genesei) using two different definitions of weighted support: with
and without normalization

[29]

.

In the non-normalized case, he used the product operator for defining
the combined weight and fuzzy value.

21

The combined weight or fuzzy value is very small and even tends to
zero when the number of items is large in a candidate itemset, so the
support level is very small, this will result in data overflow and make
the algorithm terminate unexpectedly when calculating the confidence
value.
2.11 ROUGH SET
Rough set analysis is a mathematical approach that is based on the
theory of rough sets first introduced by Pawlak (1982)

[22]

. The purpose

of rough sets is to discover knowledge in the form of business rules
from imprecise and uncertain data sources. Rough set theory is
based on the notion of indiscernibility and the inability to distinguish
between objects, and provides an approximation of sets or concepts
by means of binary relations, typically constructed from empirical
data.
Such approximations can be said to form models of our target
concepts, and hence in the typical use of rough sets falls under the
bottom-up approach to model construction. The intuition behind this
approach is the fact that in real life, when dealing with sets, we often
have no means of precisely distinguishing individual set elements from
each other due to limited resolution (lack of complete and detailed
knowledge)

and

uncertainty

associated

with

their

measurable

characteristics.
As an approach to handling imperfect data, rough set analysis
complements other more traditional theories such as probability
theory, evidence theory, and fuzzy set theory.
2.11.1

A Brief Theory of Rough Sets

Statistical data analysis faces limitations in dealing with data with high
levels of uncertainty or with non-monotonic relationships among the
variables.
The original idea behind his Rough sets theory was “… vagueness
inherent to the representation of a decision situation.

22

Vagueness may be caused by granularity of the representation. Due to
the
granularity, the facts describing a situation are either expressed
precisely
by means of ‘granule’ of the representation or only approximately.

[30]

The

the

vagueness

and

imprecision

problems

are

present

in

”

information that describes most real world applications.
2.11.2

Rough Sets as an Information System

In rough sets, an information system is a representation of data that
prescribes some object. An information system S is composed of a 4tuple S = < U, Q, V, f > where U is the closed universe of a N objects
{x1, x2, …, xN}, a nonempty finite set; Q is a nonempty finite set of n
attributes {q1, q2, …, qn} (that uniquely characterizes the objects); V
= Uq є Q Vq where Vq is the value of the attribute q; f : U × Q → V is
the total decision function called the information function such that f
(x, q) є Vq for every q є Q, x є U

[31]

. The six stores are the universe U,

the first three attributes are Q, their possible values V, and the profit
category f.
Any pair (q, v) for q є Q,, v є Vq is called the descriptor in an
information system S. The information system can be represented as a
finite data table, in which the columns represent the attributes, the
rows represent the objects and the cells represent the attribute values
f(x, q). Thus, each row in the table describes the information about an
A

object in S.

Q

If we let S = < U, Q, V, f > be an information system,
be a subset of attributes, and x, y є U are objects, then x and y are
indiscernible by the set of attribute A in S if and only if f (x, a) = f (y, a)
for every a є A. Every subset of variables A determines an equivalence
relation of the universe U, which is referred to indiscernibility relation.
For any given subset of attributes the IND(A) is an equivalence relation
on

universe

U

and

is

called

an

indiscernibility

relation.

The

indiscernibility relation IND(A) can be defined as IND(A) = {(x, y) є U ×

23

U : for all a є A, f (x, a) = f (y, a) If the pair of objects (x, y) belongs to
the relation IND(A) then objects x and y are called indiscernible with
respect to attribute set A. In other words, we cannot distinguish object
x from y based on the information
contained in the attribute set A.
2.11.3

Some Exemplary Applications of Rough Sets

Most of the successful applications of rough sets are in the field of
medicine, more specifically, in medical diagnosis or prediction of
outcomes.
Rough sets have been applied to analyze a database of patients with
duodenal ulcer treated by highly selective vagotomy1 (HSV)

[32]

. The

goal was to predict the long-term success of the operation, as
evaluated by a surgeon into four outcome classes. This successful HSV
study is still one of few data analysis studies, regardless of the
methodology, that has managed to cross the clinical deployment
barrier. There have
been a steady stream of rough set applications in medicine. Some
more recent applications include analysis of breast cancer
other forms of diagnosis
pain

[35]

[33]

and

[34]

, as well as support to triage of abdominal

and analysis of Medicaid Home Care Waiver programs

[36].

In addition to medicine, Rough Sets have also been applied to a wide
range of application areas to include real estate property appraisal
predicting bankruptcy

[38]

[37]

,

and predicting the gaming ballot outcomes

[39]

. Rough sets have been applied to identify better stock trading

timing

[40]

,

to

enhance

support

vector

manufacturing process document retrieval
performance of construction firms

machine

models

in

[41]

, and to evaluate safety

[42]

. Rough sets have thus been

useful in many applications.

24
Figure
2.3 Process map and the main steps of the rough sets analysis

2.12 SUPPORT VECTOR MACHINES
Support vector machines (SVMs) are supervised learning methods that
generate input-output mapping functions from a set of labeled training
data

[5].

The mapping function can be either a classification function (used to
categorize the input data) or a regression function (used for estimation
of the desired output). For classification, nonlinear kernel functions are
often used to transform the input data (inherently representing highly
complex nonlinear relationships) to a high dimensional feature space
in which the input data becomes more separable (i.e., linearly
separable) compared to the original input space. Then, the maximummargin hyperplanes are constructed to optimally separate the classes
in the training data. Two parallel hyperplanes are constructed on
each side of the hyperplane that separates the data by maximizing the
distance between the two parallel hyperplanes

[5]

.

An assumption is made that the larger the margin or distance between
these parallel hyperplanes the better the generalization error of the
classifier will be. SVMs belong to a family of generalized linear models
which achieves a classification or regression decision based on the
value of the linear combination of features. They are also said to
belong to “kernel methods”.
In addition to its solid mathematical foundation in statistical learning
theory, SVMs have demonstrated highly competitive performance in
numerous

real-world

applications,

such

as

medical

diagnosis,

bioinformatics, face recognition, image processing and text mining,

25

which has established SVMs as one of the most popular, state-of-theart tools for knowledge discovery and data mining.
Similar to artificial neural networks, SVMs possess the well-known
ability of being universal approximators of any multivariate function to
any desired degree of accuracy. Therefore, they are of particular
interest to modeling highly nonlinear, complex systems and processes.
 Regression
A version of a SVM for regression was proposed called support
vector
regression (SVR). The model produced by support vector
classification
(as described above) only depends on a subset of the training
data, because
the cost function for building the model does not care about
training points
that lie beyond the margin. Analogously, the model produced by
SVR only
depends on a subset of the training data, because the cost
function for building the model ignores any training data that are
close (within a
threshold є) to the model prediction
2.12.1

[6]

.

Use of SVM – A Process-Based Approach

Due largely to the better classification results, recently support vector
machines
(SVMs) have become a popular technique for classification type
problems. Even though people consider them as easier to use than
artificial neural networks, users who are not familiar with the
intricacies of SVMs often get unsatisfactory results. In this section we
provide a process-based approach to the use of SVM which is more
likely to produce better results.

26

 Preprocess the data
 Scrub the data
 Deal with the missing values
 Deal with the presumably incorrect values
 Deal with the noise in the data
 Transform the data
 Numerisize the data
 Normalize the data
 Develop the model(s)
 Select kernel type (RBF is a natural choice)
 Determine kernel parameters based on the selected kernel
type (e.g., C and ‫ ﻻ‬for RBF) – A hard problem. One should
consider using crossvalidation and experimentation to
determine the appropriate values

for these parameters.

 If the results are satisfactory, finalize the model, otherwise
change the kernel type and/or kernel parameters to
achieve the desired accuracy level.
 Extract and deploy the model.

27

Figure 2.4

2.12.2

Support

Vector

Machines

versus

Artificial

Neural Networks
The development of ANNs followed a heuristic path, with applications
and extensive experimentation preceding theory. In contrast, the
development of SVMs involved sound theory first, then implementation
and experiments.
A significant advantage of SVMs is that while ANNs can suffer from
multiple local minima, the solution to an SVM is global and unique.
Two more advantages of SVMs are that that have a simple geometric
interpretation
and give a sparse solution. Unlike ANNs, the computational
complexity of SVMs does not depend on the dimensionality of the
input space. ANNs use empirical risk minimization, whilst SVMs use
structural risk minimization. The reason that SVMs often outperform
ANNs in practice is that they deal with the biggest problem with ANNs,
SVMs are less prone to over fitting.
 They differ radically from comparable approaches such as neural
networks: SVM training always finds a global minimum, and their
simple geometric interpretation provides fertile ground for
further investigation.
 Most often Gaussian kernels are used, when the resulted SVM
corresponds to an RBF network with Gaussian radial basis
functions. As the SVM approach “automatically” solves the
network complexity problem, the size of the hidden layer is
obtained as the result of the QP procedure. Hidden neurons and
support vectors correspond to

each other, so the center problems

of the RBF network is also solved, as the support vectors serve as
the basis function centers.

28

 In problems when linear decision hyperplanes are no longer
feasible, an input space is mapped into a feature space (the
hidden layer in NN
models), resulting in a nonlinear classifier.
SVMs, after the learning stage, create the same type of decision
hypersurfaces as do some well-developed and popular NN
classifiers.
Note that the training of these diverse models is different.
However,
after the successful learning stage, the resulting decision
surfaces are
identical.
 Unlike conventional statistical and neural network methods, the
SVM
approach does not attempt to control model complexity by
keeping the number of features small.
 Classical learning systems like neural networks suffer from their
theoretical weakness, e.g. back-propagation usually converges
only to
locally optimal solutions. Here SVMs can provide a significant
improvement.
 In contrast to neural networks SVMs automatically select their
model
size (by selecting the Support vectors).
 The absence of local minima from the above algorithms marks a
major departure from traditional systems such as neural
networks.
 While the weight decay term is an important aspect for obtaining
good generalization in the context of neural networks for
regression, the margin plays a somewhat similar role in
classification problems.
 In comparison with traditional multilayer perceptron neural
networks
29

that suffer from the existence of multiple local minima solutions,
convexity is an important and interesting property of nonlinear
SVM
classifiers.
 SVMs have been developed in the reverse order to the
development of neural networks (NNs). SVMs evolved from the
sound theory to the implementation and experiments, while the
NNs followed more
heuristic path, from applications and extensive experimentation
to the
theory.
2.12.3

Disadvantages of Support Vector Machines

Besides the advantages of SVMs (from a practical point of view)
they
have some limitation. An important practical question that is not
entirely solved, is the selection of the kernel function parameters
– for Gaussian kernels the width parameter (∑) – and the value
of (є ) in the (є)-insensitive loss function.
A second limitation is the speed and size, both in training and
testing. It involves complex and time demanding calculations.
From a practical point of view perhaps the most serious problem
with SVMs is the high algorithmic complexity and extensive
memory requirements of the required quadratic programming in
large-scale tasks. Shi et al. have conducted comparative testing
of SVM with other algorithms on real credit card data.
Processing of discrete data presents another problem.
Despite these limitations, because SVMs are based on sound
theoretical foundation and the solution it produces are global and
unique in nature (as opposed to getting stuck in local minima),
nowadays they are the most popular prediction modeling techniques in
the data mining arena. Their use and popularity will only increase as

30

the popular commercial data mining tools start to incorporate them
into their modeling arsenal

[43]

.

2.13 PERFORMANCE EVALUATION FOR PREDICTIVE MODELING
Once a predictive model is developed using the historical data, one
would be curious as to how the model will perform for the future (on
the data that it has not seen during the model building process). One
might even try multiple model types for the same prediction problem,
and then, would like to know which model is the one to use for the
real-world decision making situation, simply by comparing them on
their prediction performance (e.g., accuracy). But, how do you
measure the performance of a predictor? What are the commonly
used performance metrics? What is accuracy? How can we
accurately

estimate

the

performance

measures?

Are

there

methodologies that are better in doing so in an unbiased manner?
These questions are answered in the following sub-sections. First, the
most commonly used performance metrics will be described, then a
wide range of estimation methodologies are explained and compared
to each other

[5]

.

2.13.1
In

classification

Performance Metrics for Predictive Modeling
problems,

the

primary

source

of

performance

measurements
is a coincidence matrix (a.k.a. classification matrix or a contingency
table).
The numbers along the diagonal from upper-left to lower-right
represent the correct decisions made, and the numbers outside this
diagonal represent the errors. The true positive rate (also called hit
rate or recall) of a classifier is estimated by dividing the correctly
classified positives (the true positive count) by the total positive count.
The false positive rate (also called false alarm rate) of the classifier is
estimated by dividing the incorrectly classified negatives (the false
negative count) by the total negatives.

31

The overall accuracy of a classifier is estimated by dividing the total
correctly classified positives and negatives by the total number of
samples.
Other performance measures, such as recall (a.k.a. sensitivity),
specificity
and F-measure are also used for calculating other aggregated
performance
measures (e.g., area under the ROC curves).

2.13.2

Estimation

Methodology

for

Classification

Models
Estimating the accuracy of a classifier induced by some supervised
learning
algorithms is important for the following reasons. First, it can be used
to estimate its future prediction accuracy which could imply the level
of confidence one should have in the classifier’s output in the
prediction system. Second, it can be used for choosing a classifier from
a given set (selecting the “best” model from two or more qualification
models). Lastly, it can be used to assign confidence levels to multiple
classifiers so that the outcome of a combining classifier can be
optimized. Combined classifiers are increasingly becoming more
popular due to the empirical results that
suggest them producing more robust and more accurate predictions as
they are compared to the individual predictors. For estimating the final
accuracy of a classifier one would like an estimation method with low
bias and low variance. In some application domains, to choose a
classifier or to combine classifiers the absolute accuracies may be less
important and one might be willing to trade off bias for low variance.
2.14 DECISION TREES
Decision trees are powerful and popular tools for classification and
prediction. The attractiveness of tree-based methods is due in large

32

part to the fact that, in contrast to neural networks, decision trees
represents rules. Rules can readily be expressed in English so that we
humans can understand them or in a database access language like
SQL so that record falling into a particular category may be retrieved.
There are varieties of algorithms for building decision trees which
share the describe traits of explicability. Two of the most popular go by
the

acronyms

Classification

CART
and

and

CHAID

Regression

which

Trees

stand

and

respectively

Chi-square

for

Automatic

Interaction Detection. A new algorithm, C4.5, is gaining popularity and
is now available in several software packages
2.14.1

[5]

.

Strengths of Decision-Tree Methods

The strengths of decision-trees are:
 Decision trees are able to generate understandable rules
 Decision trees perform classification without requiring much
computation.
 Decision trees are able to handle continuous and categorical
variables
 Decision trees provide a clear indication of which fields are most
important for prediction or classification.
Watch the game

Yes

No

Home team wins?

No

Yes

Out with friends

No

Diet Soda!

Out with friends?

Yes

Beer!
33

No

Yes

Milk!

Beer!

Figure 2.5 A beverage prediction tree
2.14.2

Weakness of Decision Trees Methods

Decision trees are less appropriate for estimation tasks where the goal
is to predict the value of a continuous variable such as income, blood
pressure or interest rate. Decision trees are also problematic for timeseries data unless a lot of effort is put into presenting the data in such
a way that trends and sequential patterns are made visible.
2.14.3

Application of Decision Tree Methods

Decision-tree methods are a good choice when the data mining task is
classification of records of prediction of outcomes. Use decision trees
when your goal is to assign each record to one of a few broad
categories. Decision trees are also a natural choice when your goal is
to generate rules that can be easily understood, explained, and
translated into SQL or a natural language.

2.15 GENETIC ALGORITHM
Genetic Algorithms, first introduced by [Holland 1975]

[44]

, have been

applied to a variety of problems and offer intriguing possibilities for
general purpose adaptive search algorithms in artificial intelligence,
especially, but not necessarily, for situations where it is difficult or
impossible to precisely model the external circumstances faced by the
program. Search based on evolutionary models had, of course, been
tried before Holland. However, these models were based on mutation
and natural selection and were not notably successful. The principal
difference of Holland’s approach was the incorporation of a ’crossover’
operator to mimic the effect of sexual reproduction.
Figure 2.6 below illustrates the basic idea of GA

34

Figure 2.6 Generic Model for Genetic Algorithm
Genetic algorithms are mathematical procedures utilizing the process
of genetic inheritance. They have been usefully applied to a wide
variety of analytic problems. Data mining can combine human
understanding with automatic analysis of data to detect patterns or
key relationships. Given a large database defined over a number of
variables, the goal is to efficiently find the most interesting patterns in
the database. Genetic algorithms have been applied to identify
interesting patterns in some applications.
They usually are used in data mining to improve the performance of
other algorithms, one example being decision tree algorithms,3
another association rules.
Genetic algorithms require certain data structure. They operate on a
population with characteristics expressed in categorical form. The
analogy with genetics is that the population (genes) consist of
characteristics (alleles). One way to implement genetic algorithms
is to apply operators (reproduction, crossover, selection) with the
feature of mutation to enhance generation of potentially better
combinations. The genetic algorithm process is thus:
1. Randomly select parents.
2. Reproduce through crossover. Reproduction is the operator
choosing which individual entities will survive. In other words, some

35

objective function or selection characteristic is needed to determine
survival.
Crossover relates to changes in future generations of entities.
3. Select survivors for the next generation through a fitness function.
4. Mutation is the operation by which randomly selected attributes of
randomly selected entities in subsequent operations are changed.
5. Iterate until either a given fitness level is attained, or the preset
number of iterations is reached.
Genetic algorithm parameters include population size, crossover rate
(the probability that individuals will crossover), and the mutation rate
(the probability that a certain entity mutates)
2.15.1

Genetic

Algorithm

[45]

.

Advantages:

Genetic

algorithms are very easy to develop and to validate, which makes
them highly attractive if they apply. The algorithm is parallel, meaning
that it can applied to large populations efficiently. The algorithm is also
efficient in that if it begins with a poor original solution, it can rapidly
progress to good solutions. Use of mutation makes the method
capable of identifying global optima even in very nonlinear problem
domains. The method does

not require

knowledge about the

distribution of the data.
2.15.2

Genetic

Algorithm

Disadvantages:

Genetic

algorithms require mapping data sets to a form where attributes have
discrete values for the genetic algorithm to work with. This is usually
possible, but can lose a great deal of detail information when dealing
with continuous variables. Coding the data into categorical form can
unintentionally lead to biases in the data.
There are also limits to the size of data set that can be analyzed with
genetic algorithms. For very large data sets, sampling will be
necessary, which leads to different results across different runs over
the same data set.
2.15.3

GA Operators
36

 Selection
This is the procedure for choosing individuals (parents) on which to
perform crossover in order to create new solutions. The idea is that
the ‘fitter’ individuals are more prominent in the selection process,
with the hope that the offspring they create will be even fitter still.
Two

commonly

used

procedures

are

‘roulette

wheel’

and

‘tournament’ selection. In roulette wheel, each individual is assigned
a slice of a wheel, the size of the slice being proportional to the
fitness of the individual. The wheel is then spun and the individual
opposite the marker becomes one of the parents. In tournament
selection several individuals are chosen at random and the fittest
becomes one of the parents.
 Crossover
Along with mutation, crossover is the operator that creates new
candidate solutions. A position is randomly chosen on the string and
the two parents are ‘crossed over’ at this point to create two new
solutions. Multiple point crossover is where this occurs at several
points along the string. A crossover probability (Pc) is often given
which enables a chance that the parents descend into the next
generation unchanged.
Mutation
After crossover, each bit of the string has the potential to mutate,
based on a mutation probability (Pm). In binary encoding mutation
involves the flipping of a bit from 0 to 1 or vice versa.
2.15.4

Application

of

Genetic

Algorithms

in

Data

Mining
Genetic algorithms have been applied to data mining in two ways.
External
support is through evaluation or optimization of some parameter for
another

37

learning system, often hybrid systems using other data mining tools
such as
clustering or decision trees. In this sense, genetic algorithms help
other data mining tools operate more efficiently. Genetic algorithms
can also be directly applied to analysis, where the genetic algorithm
generates descriptions, usually as decision rules or decision trees.
Many applications of genetic algorithms within data mining have
been applied outside of business.
Specific examples include medical data mining and computer network
intrusion detection. In business, genetic algorithms have been applied
to customer segmentation, credit scoring, and financial security
selection.
Genetic algorithms can be very useful within a data mining analysis
dealing with more attributes and many more observations. It saves the
brute force checking of all combinations of variable values, which can
make

some

data

mining

algorithms

more

effective.

However,

application of genetic algorithms requires expression of the data into
discrete outcomes, with a calculable functional value upon which to
base selection.
This does not fit all data mining applications. Genetic algorithms are
useful because sometimes it does fit. We review an application to
demonstrate some of the aspects of genetic algorithms.
2.16 NEURAL NETWORKS
Neural computation is introduced as an intelligent system relating the
processing parameters to the process responses such a system is
based on artificial neural network (ANN)

[46]

which is an interconnected

structure of processing elements called neurons. The ANN structure
consists of the input pattern representing the processing parameters,
the output pattern, the hidden layers describing implicitly the
correlations between the processing parameters and the output
characteristics. The connection between a couple of neurons is

38

described by a number called weight translating the strength of the
connection

[47]

.

Figure 2.7 Structure of a neural cell in human brain
There are three steps which are required to optimize the ANN
structure; these are training, validation and testing steps. There are
several types of neural network architectures. But in this study we will
focus on multilayer perception (MLP) and Back propagation Net.
2.16.1

BIOLOGICAL BACKGROUND

 Structural and Functional Organization of the Brain
The inspiration for the development of ANN lies on the organization
and
functionality of the (human) brain. The brain is organized in
different structural
levels, which correspond to small-scale and large-scale anatomical
and functional organizations. Different functions take place in
different organization levels. The hierarchy of these levels are
shown in Fig. 2.8 from the lowest (bottom) to the highest (top).
Therefore, the lowest (basic) level of brain structural organization is
the molecular level and the highest is the Central Nervous System
[48]

.

The synapses are the neuronal interconnections and their function
depends on specific molecules and ions. The next level is the neural

39

microcircuit,

which

is

an

assembly

of

synaptic

connections

organized to produce a specific functional operation. The neural
microcircuits are grouped to form dendritic subunits that are
parts of the dendritic trees of individual neurons. It is believed that
neurons are the simplest computing unit in the brain, the simplest
element that can perform computational tasks. At the next
hierarchical and complexity level we have local neural circuits
(neural networks), which are constructed from the same type of
neurons, and are able to perform operations characteristic of a
localized region of the brain

[49]

.

Central nervous system
Interregional circuits

Local Circuits

Neurons
Dentritic trees

Neural microcircuits

Senapses

Molecules
Figure 2.8 Schematic structural organization of the brain.
At a higher level these neural circuits are organized to interregional
circuits than involve multiple regional neural networks located in
different parts of the brain through specific pathways, columns and
topographic maps. These structures are organized to respond to

40

incoming sensory information. Neurophysiological experiments
have

shown

clearly

that

different

sensory

inputs

(motor,

somatosensory, visual, auditory, etc.) are mapped onto specialized
corresponding areas of the cerebral cortex. The ultimate level of
complexity and hierarchy, the interregional circuits mediate
specific types of behaviour in the central nervous system.

2.16.2

The Neuron

The key word to understand the brain structural organization and
function is the
neuron. The idea of the neurons was introduced by Ramon y Cajal in
1911 and refers to the fundamental logical units that the whole
Central Nervous System is consisted of. It is indicative that the neuron
lies somewhere in the middle of the structural organization of the brain
shown in Fig. 2.8 A neuron is a nerve cell with all of its processes.
Neurons are one of the main distinguishing features of animals (plants
do not have nerve cells). Neurons come in a wide variety of shapes,
sizes and functionality in different parts of the brain. The number of
different classes of neurons that have been identified in humans lies
between seven and a hundred (the observed wide variety in that
estimation is related to how restrictively a class of neurons is defined)
[49]

. A simple representation of a neuron is shown in Fig. 2.9

41

Fig 2.9

Schematic representation of a typical

neuron
As it is shown in Fig. 2.9 typically the neuron mainly consists of three
parts, the
dendrites (or dendritic tree) and the synapses (or synaptic connections
or synaptic
terminals), the neuron cell body, the axon. Typically the neuron can be
in two states:
the resting state, where no electrical signal is generated, and the firing
state, where the neuron depolarises and an electrical signal is
generated (that is the output of the neuron).

[48]

The neuron receives inputs from other neurons that are connected to
it, via synaptic connections that are mainly positioned in the dendrites.
The incoming signals (which are in the form of positive or negative
electrical potentials) are summed in neurons cell body (also called
42

soma) and if the obtained sum exceeds a certain amount, which is
referred as the activation threshold, then the neuron depolarises and
an electrical pulse is generated. This pulse is commonly known as
action potential or spike.
Originating at or close to the cell body of the neuron the action
potential propagates
through the axon of the neuron at constant velocity and amplitude to
the synaptic
terminals. Through these synaptic terminals the electrical signals
generated at one
neuron are transmitted to the neurons that it is interconnected to.
Typically, neural events happen in the millisecond (10-3 sec) range,
whereas in silicon chip the corresponding time range is of the order of
nanosecond (10-9 sec). Thus, biological neurons are five to six
hundreds of magnitude slower than silicon chips.
2.16.3

Dendrites and Synapses

The dendrites, the receptive zones of the neuron, have an irregular
surface and a great number of branches. As it is shown in the right top
of Fig. 1.2 there are observed dendritic spines and synaptic inputs in a
dendrite. These synaptic inputs are the points that a neuron is
connected to other neurons and receive input signals from them. Thus
synapses are the elementary functional and structural units that
mediate the interactions between neurons. A number of one to ten
thousand of incoming synapses is typical for cortical neurons. With
respect to the nature of the signal that is transferred through a
synapse there are two kinds of synaptic connections, the chemical
synapse and the electrical synapse, with the former to be the most
common

[49]

.

In the case of the chemical synapse there is no actual contact of the
presynaptic and
the postsynaptic neuron. A synaptic gap (synaptic cleft) occurs instead
and the

43

chemical synapse operates as follows: when an electrical signal arrives
from the
presynaptic neuron to the synapse, a process at the presynaptic
neuron liberates a
number of molecules of a chemical substance called neurotransmitter.
These
neurotransmitter molecules diffuse across the synaptic gap and are
captured in
specialized regions of the dendrites of the postsynaptic neuron, by
molecules that are called neuroreceptors, and generate electrical
signals in the postsynaptic region. Thus, a chemical synapse converts
electrical signals that are generated in the presynaptic neuron into
chemical signals that travel through the synaptic gap and then back
into postsynaptic electrical signals.
It is obvious that this kind of synaptic transmission is unidirectional
and
nonreciprocal, i.e., chemical synapses carry signals from a neuron that
always plays
the role of the presynaptic unit to another neuron that always plays
the role of the
postsynaptic unit. This is the main difference between chemical and
electrical
synapses

[49]

.

In the case that two neurons are interconnected via an electrical
synapse, an electrical signal can be transmitted from the neuron with
higher voltage to the one with a lower voltage, thus signal
transmission

can

be

bi-directional

in

electrical

synapses.

This

characteristic of the electrical synapses means that there is no fixed
presynaptic and postsynaptic neuron in that kind of synaptic
connections and these roles can be interchanged depending on the
electrical conditions on each one of the interconnected neurons.
Further from distinguishing synapses to chemical and to electrical ones
according to the nature of the transmitting signal, we classify the

44

synapses with respect to the kind of activation that is produced to the
postsynaptic neuron in two main categories: the excitatory synapses
and the inhibitory synapses. In the first case, that of the excitatory
synapses,

the

electrical

potential

that

is

transmitted

to

the

postsynaptic neuron is positive and has an excitation effect. In the
second case of the inhibitory synapse the postsynaptic potential is
negative and imposes inhibition on the postsynaptic neuron.
A key point in the synaptic transmission is that the signals are
weighted. That is, some postsynaptic potentials are stronger than
others.
2.16.4

Neuron Cell Body

The neuron cell body (or soma) has a triangular – like form and
contains the nucleus of the cell. As it is shown in Fig. 1.2, the dendrites
are leading into the neuron cell body, carrying the incoming inputs
(electrical signals generated by the postsynaptic potentials). These
electrical signals affect the membrane potential of the cell body of the
neuron. Typically, when in the resting state, the membrane potential of
a neuron is approximately –70 mV. If the incoming postsynaptic
potential is positive (excitatory) the membrane potential is increased
and is moving closer to the firing state. If the incoming postsynaptic
potential is negative (inhibitory) the membrane potential is decreased,
moving away from the firing state

[49]

.

All the incoming postsynaptic potentials are summed in both time
and space
(temporal and spatial summation). If the resulted sum is equal to or
greater than the firing threshold of the neuron, and the membrane
potential exceeds a certain value (typically –60 mV) then the neuron
depolarises (fires) and an action potential is generated and propagated
through the axon of the neuron to the synaptic terminals.
After firing, the neuron returns to the resting state and the membrane
potential

to

the

appropriate

resting

value.

This

is

not

done

instantaneously, but takes a little time which is called refractory period

45

of the neuron. When the refractory period is passed, the neuron is
ready to fire again if it receives the appropriate input.
2.16.5

The Axon

In cortical neurons, the axon is very long and thin and is characterized
by high
electrical resistance and very large capacity. The neural axon is the
main transmission line of the neuron that propagates the action
potential. The axon has a smoother surface than the dendrites and
carries the characteristic Ranvier nodes (not shown in Fig. 2.8) that
help the propagation of the action potential along the axon. The axon
terminates to the synaptic terminals that establish the interconnection
of the neuron to other neurons

2.16.5

[49]

.

The Neuron Model

To built-up an ANN we need to model the biological neuron, the
elementary
computing unit in the brain that is capable to perform informationprocessing
operations. The simplest model of a neuron is shown in Fig. 2.10.
Neurons, also referred as processing elements (PE), nodes, short-term
memory
devices, or threshold logic units, are the ANN components where the
most, if not all,
of the computing is done. The generic model of the neuron shown in
Fig. 2.10 consists the basis for designing and implementing ANNs. As it
is indicated in Fig. 2.10 three are the basic elements of the neuronal
model: a set of synapses or synaptic (connecting) links, an adder
(logical unit) and an activation function (threshold function).
The synapses, or connecting links carry the input signals to the
neuron, coming from either the environment or outputs of other

46

neurons. Each synapse is characterized by a weight or strength of its
own, which will affect the impact of the specific input.
Therefore, the incoming signals to a neuron are weighed, multiplied by
the
appropriate value of the synaptic weight. To be more specific, a signal
xj at the input
of synapse j of the k th neuron is multiplied by the synaptic weight w kj.
In the notation following here, the first subscript refers to the neuron in
question and the second subscript refers to the input to which the

weight refers.
Figure 2.10

Model of a neuron.

In the notation following here, the first subscript refers to the neuron in
question and the second subscript refers to the input to which the
weight refers. In general and in accordance to the biological figure,
there are two primary types of synaptic connections: the excitatory
and the inhibitory ones. The excitatory connections increase the
neuron’s activation and are typically represented by positive signals.
On the other hand, inhibitory connections decrease neuron’s activation
and are typically represented by negative signals. The two types of

47

connections are implemented using positive and negative values for
the corresponding synaptic weights

[49]

.

One of the most important features of the model neuron presented
here, as well as the biological neurons, is that the values of the
synaptic weights are subject to alterations and modification in
response to various inputs and in accordance to the network’s own
rules for modification. This feature that is technically called synaptic
modification is of great importance since it is closely related to that
ability of
adaptation and learning of the ANN.
Sometimes there is an additional parameter b k, that is associated with
the inputs. The role of this additional parameter depends on the type
of the activation function.
Typically it is considered to be an internal bias, which can also be
weighted. In a
somehow different approach this parameter is a threshold value
(denoted by θk for the kth neuron) that must be exceeded for there to
be any neuronal activation. In general it is a parameter that has the
effect of either increasing or decreasing the neuron’s input υ k to the
activation function if its value is positive or negative respectively
[47].
The second basic element of the model neuron is the adder. This
element is
responsible for summing the input signals to the neuron that are
transmitted through the synapses of the neuron and are weighted by
them. The described operations constitute a linear combiner. As
mentioned above the total result of the summation of the incoming
weighted signal and the addition of the bias b k or subtraction of the
threshold θk is denoted by υk.
The third basic element of the model neuron is the activation function,
which is also

48

referred as squashing function or signal function. The role of the
activation function is to squash (limit) the output signal of the neuron
to a certain (finite) range. Thus,
activation function maps a (possibly infinite) domain (the input) to a
pre-specified
range (the output). A great number of mathematical functions should
be suitable for
the role of the activation function of a neuron. However, four are the
most common
families of functions that are widely used: the step, the linear, the
ramp and the
sigmoid functions

[50]

.

The step (or threshold) function is described by the following equation:

Thus, the step function of (Eq. 2.1) returns a positive value if its
argument is a nonnegative number, otherwise it returns a negative
value if its argument is a negative number. A special case of the step
function is for γ = 1 and δ = 0. In that case (Eq. 2.1) is transformed to
(Eq. 2.2):

This special case of the step function is commonly referred as a
Heaviside function. A plot of the Heaviside function is shown in Fig.
2.11a. A neuron that incorporates the Heaviside function as its
activation function is usually referred as the McCulloch-Pitts neuron
model, in recognition of the pioneering work done by McCulloch and
Pitts back in 1943. According to that neuron model the output of a
neuron turns to the firing state generating an output signal equal to 1
if the total input to the neuron is non-negative. Otherwise, in the case
that the total input is negative the neuron remains in the resting state,

49

generating no signal (zero output). This characteristic behaviour is
described by the special term and is referred as the all-or-none
property of the McCulloch-Pitts model

[47]

. The all-or-none property

is in accordance to the behaviour of the biological neurons where the
total postsynaptic potential (inputs) must exceed a certain internal
threshold value in order the neuron to fire and generate an action
potential. If that threshold value is not exceeded the neuron remains in
the resting state and ceases.
The next family of activation functions is the linear function, described
in its general
form by the equation:

φ(υ) = αυ

(Eq. 2.3)

The parameter α is a real-valued constant that regulates the
magnification of the
neuron activity υ. Despite its simple form the linear function is rather
inappropriate
for the role of the activation function of a neuron, since it is not
bounded (considering that the input parameter υ is not bounded too).
The third family of commonly used activation functions is the ramp
function, also
referred as piece-wise linear function. The ramp function is a linear
function that is
bounded to the range [-γ, +γ] and in its general form is described by
the equation:

In the above equation γ and –γ correspond to the maximum and the
minimum output values respectively, i.e., the upper and lower bound
of the mapping. The piece-wise linear functions of (Eq. 2.4) are often
used to represent a simplified nonlinear operation and can be viewed
as an approximation to a nonlinear amplifier. Depending on the value

50

of the input parameter υ the ramp function operates as a linear
function without running into saturation, if υ is in the linear region,
otherwise the function returns the upper or lower saturation values. In
the special case that for γ = 1/2 and 1 and 0 as upper and lower bound
respectively (Eq. 2.4) takes the form:

A graphical representation of the ramp function described in (Eq. 2.5)
is shown in Fig. 2.11b. As it is shown in Fig. 2.11b this special form of
the ramp function exhibits a linear part in the range of – 1/2< υ < 1/2
and saturates to the upper or the lower bound if υ exceeds that range.
The fourth and final family of activation functions is the sigmoid
functions. The
family of sigmoid functions is by far the pervasive type of activation
function and is
the most commonly used in the implementation of an ANN. That is
because the
sigmoid functions incorporate a number of properties that are mostly
desirable in the construction of a neuron. There are several types of
sigmoid functions. A common type is the logistic function that is
described in the following equation:

The parameter α is the slope parameter of the sigmoid function.
Graphical
representations of the logistic sigmoid function for different values of
the slope
parameter α are shown in Fig. 2.11c. The shape of the obtained
representations reveals that the reason for the sigmoid functions to
have been given that name is the s-shape of its graphs. As it is easily
recognised in Fig. 2.11c the logistic sigmoid function is a bounded,

51

monotonic, non-decreasing function that provides a graded, nonlinear
response. Thus, the logistic function balances between linear and
nonlinear behaviour.
The upper and lower bounds (saturation values) of that function are 1
and 0
respectively. Another feature of the logistic function that is partially
revealed in Fig.
2.11c is the role of the slope parameter α. The greater the value of
that parameter the steeper is the increase of the logistic function. In
the limit that the slope parameter approaches infinity, the logistic
function turns simply to a step (Heaviside) function.
However, for values of the slope parameter in the normal range, the
logistic function is a continuous and differentiable function that returns
a continuous range of values from 0 to 1 (graded response). On the
opposite the Heaviside function is not differentiable.
A second sigmoid type function that ranges in the interval [0,1] is the
augmented ratio of squares function defined as:

What is common in the activation functions described by the (Eqs. 2.2,
2.5, 2.6 and
2.7) is that return an output in the range from 0 to 1. However,
sometimes it is

52

desirable to have an activation function in the range from –1 to 1. In
that case we have to give a different definition of the threshold
function of (Eqs. 2.1 and 2.2). The new form of the threshold function

a

b

c

is described by the following equation:

Fig 2.11.

Three common types of activation functions. (a) Threshold

(Heaviside) function. (b) Piecewise-linear (ramp) function. (c) Sigmoid
function for varying slop parameter a.

The above equation is commonly referred to as the signum function
since it returns the sign of the parameter υ or 0 if υ in neither positive,
nor negative.
Similarly, other types of sigmoid functions have to be presented for
the case that the
output range from –1 to 1 is the desirable one, instead of the range
from 0 to 1 that
return the logistic sigmoid function of (Eq. 2.6). If that is the case,
among others, two are some reasonable candidates. The first one is a

53

hyperbolic trigonometric function, the hyperbolic tangent function,
which is defined as:

φ (υ ) = tanh(υ )

(Eq. 2.9)

The second one is defined by the formula:

Both these functions defined in the last two equations have saturation
levels at –1
(lower) and 1 (upper), therefore range in [-1,1].
The description of the neural dynamics in mathematical terms follows.
According to
the notation introduced above, assuming that the k th neuron receives
m synaptic
connections, υk is the total sum of the incoming input weighed signals
xj via the jth
synaptic connection, and wkj is the corresponding synaptic weight of
that connection, the threshold is θk and the bias is bk. In the case that
the adder sums the total incoming weighted signals and subtracts the
threshold θk the obtained result υk is given by the mathematical
formula:

In (Eq. 2.13), the bias bis included in the form as the product W k0X0.,
where X0 = 1 and Wk0= bk.

54

Finally, let yk be the output signal of the k th neuron that receives a total
incoming
signal υk. The output of the neuron is given by the next formula:
y κ =φ υ k y

(Eq.

2.14)
In the above equation, φ(υ) is the activation function, which should be
given by one of the described in Eqs. 2.1 – 2.10.
The

neuron-like

processing

element

presented

here

model

approximately three of the processes we know real neurons perform.
As far as we know, there are at least 150 processes performed by the
neurons in the human brain. Although the obvious
poverty of the model neuron, it handles several basic functions.
Namely, the model
neuron is capable to receive and evaluate the input signals, to
calculate a total of the
combined inputs and compare that total to some threshold level, and
finally to
determine what the output should be. In addition to the deterministic
neuronal model presented above, for some applications of neural
networks it is desirable to
incorporate a stochastic feature in the dynamics of the neural model.
In such a case,
the neuronal model is based on a modification of the bi-state neuronal
element of
McCulloch-Pitts and it is permitted to reside in only two states: +1 and
–1. The
decision of a neuron to alter its state is probabilistic. Thus, if the
neuron fires (is in the +1 state) with probability of firing P(υ), then it
remains in the –1 state with
probability 1-P(υ). The firing probability is given by the formula:

55

In the above formula T is a pseudo-temperature that is incorporated to
control the
noise level, thus the uncertainty and the stochastic nature in firing,
and must be
realised as a parameter that represents the effects of synaptic noise.
In the limit case that T → 0, the stochastic neural model reduces to the
noiseless (therefore deterministic) form that is described by the
McCulloch-Pitts neural model in (Eq. 2.2)
2.16.6

[47]

.

Supervised and unsupervised learning

The learning algorithm of a neural network can either be supervised or
unsupervised.
A neural net is said to learn supervised, if the desired output is already
known.
Neural nets that learn unsupervised have no such target outputs.
It can't be determined what the result of the learning process will look
like.
During the learning process, the units (weight values) of such a neural
net are "arranged" inside a certain range, depending on given input
values. The goal is to group similar units close together in certain
areas

of

the

value

range.

This effect can be used efficiently for pattern classification purposes
[51]

.

2.16.7

Forward propagation

Forward propagation is a supervised learning algorithm and describes
the "flow of information" through a neural net from its input layer to its
output layer.
The algorithm works as follows:
1. Set all weights to random values ranging from -1.0 to +1.0
2. Set an input pattern (binary values) to the neurons of the net's
input layer
3. Activate each neuron of the following layer:
56

Multiply the weight values of the connections leading to this
neuron

with the output values of the preceding neurons

Add up these values
Pass the result to an activation function, which computes the
output value of this neuron
4. Repeat this until the output layer is reached
5. Compare the calculated output pattern to the desired target pattern
and

compute an error value

6. Change all weights by adding the error value to the (old) weight
values
7. Go to step 2
8. The algorithm ends, if all output patterns match their target
patterns
2.16.8

Multi-layer Perceptron: This was first introduced

by M. Minsky and S. Papert in 1969

[52].

It is a special case of

perceptron whose first layer units are replaced by trainable threshold
logic units in order to allow it to solve non-linear separable problem.
Minsky and Papert called multi-layer perceptron of one trainable
hidden layer a Gamba perceptron. The structure is shown below:

.
.
.

.
.
.

.
.
.

Input
Layer

First
Hidden
Layer

Second
Hidden
Layer

Figure 2.12

Output
Layer

Each layer is fully connected to the next one. Depending on the
complexity, performance and implementation point of view, the
57

number of hidden layers may be increased or decreased with
corresponding increase or decrease in the number of hidden units and
connections.
Both the perceptron and the multi-layer perceptron are trained with
error-correction learning. But since perceptron does not have an
explicit error available, this stopped further work on the multi-layer
perceptron

around

1970,

until

a

method

to

train

multi-layer

perceptrons was later discovered. The method is called Back
Propagation or the generalized Delta Rule.
With this method, processing is done from the input to the output
layer, that is, in the forward direction, after which computed errors are
then propagated back in the backward direction to change the weights
so as to obtain a better result.
2.16.9

Strength of Artificial Neural Networks

 They can handle a wide range of problems
 They provide good results even in complicated domain
 They handle both categorical and continuous variables
 They are available in many off-the-shelf packages
2.16.10

Weaknesses of Artificial Neural Networks

 They require inputs in the range of 0 to 1
 They can not explain their results
 They may converge prematurely to an inferior solution
2.17 ON-LINE ANALYTICAL PROCESSING
OLAP is the next advance in giving end-user access to data.
These are client-server tools that have an advance graphical interface
talking to an efficient and powerful presentation of the data called a
cube. The cube is ideally suited for queries that allow users to sliceand-dice the data in any way they see fit. The cube itself is stored in

58

either a relational database, typically using a star schema or in a
special multi-dimensional database that optimize OLAP operations.
OLAP tools have a very fast response times, measured in seconds. SQL
queries on standard relational database would require hours or days in
many cases to generate the same information. In addition, OLAP tools
provide handy analysis functions that are difficult or impossible to
express in SQL.
2.17.1

OLAP and Data Mining

We have to provide feedback to people and use the information from
data mining to improve business process. We need to enable people to
provide input, in the form of observations, hypotheses and hunches
about what results are important and how to use those results

[6]

.

In the larger solution to exploit data, OLAP clearly plays an important
role as a means of broadening the audience with access to data.
2.17.2

Strengths of OLAP

 It is a powerful visualization tool.
 It provides fast, interactive response time
 It is good for analyzing time series
 It can be used to find some clusters and outliers
 Many vendors offer OLAP products
2.17.3

Weaknesses of OLAP

 Setting-up a cube can be difficult
 It does not handle continuous variables well
 Cubes can quickly become out-of –date
 It is not data mining

59

2.18

DATA MINING APPLICATION AREAS

Other application areas are:
 Health sector
 Food And Drug Product Safety
 Election analysis
 Detection Of Terrorists Or Criminals
 Etc
2.19 DATA MINING TOOLS
Many good data mining software products are available

[5]

:

 Enterprise Miner by SAS
 Intelligent Miner by IBM
 CLEMENTINE by SPSS
 PolyAnalyst by Megaputer
 WEKA (from the University of Waikato in New Zealand) etc

60

Given a CSP P = (V,D,C), its dual
transformation dual(p) = (Vdual(p), Ddual(p), C Ddual(p))
is defined as follows.

Vdual(p) = {c1,…,cn}where c1,….cm are called dual
variables.
For each constraint Ci of P there is a unique
corresponding dual variable ci. We use vars(ci) and
rel(ci) to denote the corresponding sets vars(Ci) and
rel(Ci) (given that the context is not ambiguous).
Ddual(p) = {dom(c1),…,dom(cm)} is the set of
domains for the dual variables. For each dual
variable ci, dom(ci) = rel(Ci), i.e., each value for is a
tuple over vars(Ci). An assignment of a value t to a
dual variable ci, ci ← t , can thus be viewed as being
a sequence of assignments to the ordinary variables
xϵvars(ci) where each such ordinary variable is
assigned the valuet[x].

Cdual(p) is a set of binary constraints overVdual(p)
called the dual constraints.
There is a dual constraint between dual variables ci
and cj if S = vars(ci) ⋂ vars(cj) ≠ ∅. In this dual
constraint a tuple tiϵdom(ci) is compatible with a
tuple tjϵdom(cj) iff ti[S] = tj[S], i.e., the two tuples
have the same values over their common variables

61

Given a CSP P = (V,D,C), its hidden transformation hidden(p) = (V hidden (p), D hidden (p), C D hidden (p)) is
defined as follows:
V hidden (p), {x1,…., xn} ∪ {c1,…,cn}where {x1,…., xn} is the original set of
variables in V (Called ordinary variables) and c 1,….cm are called dual
variables generated form from the constant C
There is a unique dual variable corresponding to each constraint Ci ϵ C.
When dealing with the hidden transformation, the dual variables are
sometimes called hidden variables.
D hidden (p)= {dom(x1),….,dom( xn)} ∪ {dom(c1),…,dom(cm)} is the set of
domains for the dual variables. For each dual variable c i, dom(ci) = rel(Ci),
V = {x1,…., xn} is a finite set of n variables.
D = {dom(x1,.….,dom(xn)} is a set of domains. Each variable xϵV has a
corresponding finite domain of possible values, dom(x).
C = {C1,……,Cm} is a set of m constraints. Each constraint C ϵ C is a pair
(vars(C), rel(C)) defined as follows:

62

My Chapter Two

Comments

Content

Sponsor Documents

Recommended