data mining

Published on January 2017 | Categories: Documents | Downloads: 35 | Comments: 0 | Views: 146

of 20

crinaus2003
Subscribe 0

Content

Turku Centre Computer Science for
TUCS Technical Report
No 1060, October 2012
Author One | Author Two | Author Three
Author Four | Author Five
Title of the Technical Report

Peter Sarlin
A Weighted SOM for classifying
data with instance-varying
importance

TUCS Technical Report
No 1060, October 2012

A Weighted SOM for classifying data with
instance-varying importance

Peter Sarlin
Åbo Akademi University, Department of Information Technologies

Abstract
This paper presents a Weighted Self-Organizing Map (WSOM) that combines the
advantages of the standard SOM paradigm with learning that accounts for instance-
varying importance. While the learning of the classical batch SOM weights data by a
neighborhood function, it is here augmented it with a user-specified instance-specific
importance weight for cost-sensitive classification. By focusing on instance-specific
importance to the learning of a SOM, we take a perspective that goes beyond the
common approach of incorporating a cost matrix into the objective function of a
classifier. We compare the WSOM with a classical SOM and logit analysis in two
financial classification tasks: financial crisis prediction and credit scoring. The
significance of instance-varying importance weights, and the performance of the
WSOM, in the two financial settings is confirmed by being superior in terms of cost-
sensitive classification performance in both applications. When setting the weight to be
the importance of an instance for forming clusters, the WSOM may also be seen as an
alternative for cost-sensitive unsupervised clustering.

Keywords: Weighted Self-Organizing Map, instance-varying cost, cost sensitive
classification

TUCS Laboratory
Laboratory for Data Mining and Knowledge Management

3
1. Introduction
In the early days of data mining literature, works dealing directly with misclassification
cost issues were a rare occurrence. Since the turn of the century, cost issues have, however,
received wide attention by the research community (see e.g. [1] for an extensive review).
Misclassification costs are most often derived from cost matrices, but can take various forms:
equal misclassification costs, unequal misclassification costs between classes and instance-
varying costs. Varying costs have been acknowledged in evaluation frameworks in various
domains, such as fraud detection [2,3] and financial stability surveillance [4], as well as for
various measures, such as Receiver Operating Characteristics (ROC) curves [5]. It is,
however, not only crucial to acknowledge variation in costs, but also of central importance to
integrate them into the learning of a classifier. For example, Hollmén et al. [6] integrated a
similar cost model as the one in [2] into the learning of a Hidden Markov Model for the
telecommunications fraud domain. Zadrozny and Elkan [7] propose a method called direct
cost-sensitive decision making for modeling directly from cost matrices with instance-varying
costs. Similarly, Fawcett's [8] rule learning system designed to maximize ROC performance is
also feasible to optimize his instance-varying ROC [5]. However, this type of integration is
applicable, and even advisable, with other methods and in other domains as well.
The Self-Organizing Map (SOM) [9] performs simultaneously a clustering/classification
and a projection onto a low-dimensional grid of units. Due to several advantageous features,
such as visual output, simple and intuitive formulation and low computational cost (see e.g.
[10] for an overview), the SOM has become a frequently utilized data analysis methods with
more than 10,000 applications. The scope of application is broad not only in terms of the
domains explored, but also in the types of data analysis tasks attempted, such as clustering vs.
classification, clustering vs. visualization and numerical vs. text analysis. The main rationale
for using the SOM for classification over more traditional methods is its inherent local
modeling property and topology preservation of units that enhances the understanding of the
problem (for a review see [11]). The SOM paradigm has been adapted for a wide variety of
specialized tasks, such as the SOTM for exploratory temporal structure analysis [12], the
WEBSOM for text analysis [13] and TSOM for time-series forecasting [14], but only a few
studies have attempted to modify the priority or weighting of data when training a SOM.
Kohonen [15] introduced already in his seminal works on SOMs one type of weighting of
data for eliminating border effects to units on the edges of a SOM grid. This was further
developed in [16,17] by weighting data blocks based upon their statistical properties in image
compression. Similar instance weighting may also be applied to balance imbalanced samples
[e.g. 18]. Another type of weighting is to vary the influence of variables on distance
calculations in the matching phase of SOM learning [e.g. 18]. For instance, when creating
binary variables from a categorical variable, Yao et al. [19] suggest setting the influence of
each binary variable to one divided by the number of categories. While the weighting in [18] is
applicable for the task, weighting of the SOM has not, to the best of our knowledge, been
elaborated with a scheme that matches the instance-varying importance of data.
This paper proposes a Weighted SOM (WSOM) that combines the advantages of the
standard SOM paradigm with learning that accounts for instance-varying importance. While
the learning of the classical batch SOM weights data by a neighborhood function (e.g. Voronoi
regions), we propose to augment it with a user-specified, instance-specific importance. Hence,
we take a broader perspective that goes beyond incorporating only a cost matrix into the
objective function of a classifier. The weights may be seen as the importance associated with
misclassifying or mislabeling that instance, but do not have to be explicit costs. The aim of the
WSOM is thus to improve learning from data with instance-varying importance rather than
optimizing an objective function based upon classification performance. To this end, the

4
WSOM is not restricted to only classification tasks. It may also be seen as a feasible
alternative for cost-sensitive unsupervised clustering, as the weight could also represent
importance of an instance for forming clusters. Hence, rather than introducing a complex
novel algorithm based upon the minimization of an objective function as per costs of
instances, the appeal of this approach is threefold: it is simple, it relies on a solid range of
previous SOM research and like the SOM is applicable for a wide range of tasks, e.g.,
classification, clustering and visualization.
While the WSOM is applicable in any domain with available instance-varying costs, the
focus in this paper lies on two financial classification tasks: financial crisis prediction and
credit scoring. The SOM, and thus also its weighted counterpart, is originally an unsupervised
learning algorithm; however, the rationale for experiments in classification tasks is the
evaluation of their performance. Following Sarlin [4], we derive an evaluation framework that
includes a loss function and Usefulness measure of a decision maker with user-specified costs
of type 1 and 2 errors, as well as instance-specific importance. This enables testing whether,
and to what extent, an inclusion of instance-specific importance in the training algorithm leads
to improved classification performance. First, we apply the WSOM to the prediction of
country-level systemic financial crises. This is particularly motivated as Early Warning
Systems (EWSs) commonly utilize global pooled panel data, i.e. cross-sectional time-series
data. Hence, while there are large variations in the importance or systemic relevance of
individual countries, the definition of importance depends entirely on the perspective of the
decision maker. It is also obvious that giving false alarms and missing crises would
significantly differ in costs to the decision maker. Second, we apply the WSOM to a UCI
repository dataset on German credit scoring, which similarly has significant differences in
costs related to the credit amount and the type of error. Hence, this paper may be seen as an
extension of two prior works. First, the weights of the WSOM relate to an existing cost-
sensitive evaluation framework [4] in that the learning of the WSOM accounts for the
instance-varying weights from that framework. Second, we extend the learning of the SOM-
based EWS in Sarlin and Peltonen [20] to include observation-specific weights.
The paper is organized as follows. Section 2 introduces the SOM, the WSOM, and their
properties, visualizations and evaluation, as well as the classification evaluation framework
applied in the paper. Section 3 applies the WSOM in two financial real-world settings:
financial crisis prediction and credit scoring. Finally, Section 4 concludes by presenting key
findings, as well as future research directions.

2. Methodology
This section introduces the methods used in the paper. First, we describe the standard SOM
and its weighted counterpart, the WSOM. Second, we describe decision goals and the
evaluation of two-class predictions from the viewpoint of a decision maker.
2.1 The Self-Organizing Map (SOM)
The SOM [9] is a neural network based method for clustering and projection. As the
WSOM is only a weight from the specification of the classical SOM algorithm, we first derive
the standard SOM and then its weighting. In this paper, we focus on a batch version of the
SOM training algorithm that processes all data simultaneously instead of in sequences.
Motivations for using the batch algorithm are the reduction of computational cost and
reproducible results (given similar initializations). We start the training process by setting the
reference vectors to the direction of the two principal components of the input data. This type
of an initialization has also been shown to be important for convergence when using the batch

5
SOM [21], in addition to its obvious advantages of decreased computational cost and
reproducibility.
Following Kohonen [9], the SOM training iterates through 1,2,…,t in two steps. In the first
step, each input data vector x is assigned to the best-matching unit (BMU) m
c
:
. ) ( min ) ( t m x t m x i
i
c ÷ = ÷ (1)
In the second step, each reference vector i m (where i=1,2,…,M) is adjusted using the batch
update formula:

¿
¿
=
=
= +
N
j
j ic
N
j
j j ic
i
t h
x t h
t m
1
) (
1
) (
) (
) (
) 1 ( (2)
where index j indicates the input data vectors that belong to node c, and N is the number of the
data vectors. The neighbourhood ( | 1 , 0 ) ( e j ic h is defined as the following Gaussian function:
,
) ( 2
exp
2
2
) (
|
|
.
|

\
|
÷
÷ =
t
r r
h
i c
j ic

(3)
where
2
i c r r ÷ is the squared Euclidean distance between the coordinates of the reference
vectors m
c
and m
i
on the two-dimensional grid, and the radius of the neighbourhood ) (t  is a
monotonically decreasing function of time t.

2.2 The Weighted SOM (WSOM)
In many situations the importance of each instance is not constant. To this end, we need to
adopt the SOM algorithm with a learning rule that incorporates the importance of each
instance. The approach used herein is simple and intuitive as it turns the classical SOM
specification into its weighted counterpart, the WSOM, while preserving the general properties
of the SOM. The WSOM is nothing more than the batch SOM that adjusts the importance each
instance j x with a weight j w .
1
While this does not affect the matching in (1), multiplying the
neighborhood
) ( j ic h of each instance j x with its corresponding weight j w in the update
formula in (2) gives j x its proper amount of influence over the estimated reference vectors i m .
Hence, the update formula of the WSOM takes the following form:
,
) (
) (
) 1 (
1
) (
1
) (
¿
¿
=
=
= +
N
j
j ic j
N
j
j j ic j
i
t h w
x t h w
t m (4)

1
Weighting would also be applicable for the sequential algorithm. The weight should, however, be applied to the learning
rate α rather than to each data point in (4). Kohonen [9] further reminds that αW <1 to guarantee stability, which implies that
weighting would not be applicable during the first training cycles with large α. This is not, however, a concern with the batch
algorithm.

6
where weight j w represents the importance of j x for the learning of patterns. While the
sum or relative size of weights j w do not need to fulfill any specific constraints, some
conditions may facilitate their interpretability, such as 1 ) (
1
=
¿
=
N
j
j
s w for any time unit
s=1,2,…,S and S s w
S
s
N
j
j
=
¿ ¿
= = 1 1
) ( . The weights j w can obviously also be used for equal
sampling of classes as well as some other desired aspect.
As the WSOM is only a minor weight from the classical SOM, aspects related to the
computation, visualization and evaluation of the two methods do not differ significantly.
While we do not provide an overview here, it is important to note that the added value of
possible extensions of the SOM should also be considered for the WSOM. For instance, the
same algorithmic short-cuts and extensions for improving SOM performance apply to the
WSOM as well. The output of the WSOM may also be visualized and evaluated with similar
methods as the classical SOM. We, however, limit the focus of this paper to classification
performance. Similarly, rather than the standard quality measures, such as quantitative and
topographic error, which would also be applicable for the WSOM, we focus on classification
performance when calibrating the models.
An adaptation of the standard SOM paradigm that is also applicable for the WSOM, and
tested in this paper, is the use of the method in a semi-supervised manner (see e.g. Kohonen's
Hypermap [22]). It is most common to use the SOM for unsupervised learning, where the
explanatory variables, or inputs, are used for learning previously unknown patterns in data.
However, if one possesses, for instance, class labels, they can be used to supervise learning for
a classification task. The main rationale for using the SOM over more traditional methods for
classification is its inherent local modeling property and topology preservation of units that
enhances the understanding of the problem, as well as the availability of, for instance, growing
architectures that facilitate the choice of parsimony (for a thorough review see [11]). While
unsupervised versions use only the explanatory variables in the matching formula (1), the
supervision of the semi-supervised versions is introduced by the use of both the explanatory
and the class variables in matching. Both unsupervised and supervised versions may or may
not include the classes in the batch update formula (2) without affecting the general learning
procedure.
2.3 Evaluating classification performance
An important part of classification tasks is the evaluation of the results and the measures
used for setting thresholds, or cut-off values, for probability forecasts. The importance of an
evaluation framework that accounts for varying misclassification costs is further highlighted in
this work as we attempt to test how an inclusion of this variation in the training algorithm
affects model performance. The framework applied here follows that in Sarlin [4]. We derive a
loss function and Usefulness measure for a cost-aware decision maker with class and instance-
varying misclassification costs.
The occurrence of an event of interest is represented with a binary state variable { } 1 , 0 ) ( e h I j ,
where the index j=1,2,…,N represents instances and h is a specified forecast horizon. Various
methods can be used for turning univariate or multivariate data into probability forecasts of the
occurrence of the event | | 1 , 0 e j p . To mimic an ideal indicator ) (h I j , the probabilities p
j
need to
be transformed into binary point forecasts { } 1 , 0 e j P that equal one if p
j
exceed a specified
threshold  and zero otherwise. The correspondence between P
j
and I
j
can be summarized
into a so-called contingency matrix (frequencies of prediction-realization combinations): false
positives (FP), true positives (TP), false negatives (FN) and true negatives (TN).

7
While entries of a contingency matrix can be used to define a large palette of goodness-of-
fit measures, such as overall accuracy, we approach the problem from the viewpoint of a
decision maker that is concerned of conducting two types of errors: type 1 and 2 errors. Type 1
errors represent the conditional probability ( ) 1 ) ( = s h I p P j j  , or estimated from data as the
share of false negatives to all positives ( ( ) TP FN FN T + = / 1 ), and type 2 errors the conditional
probability ( ) 0 ) ( = > h I p P j j  , or estimated from data as the proportion of false positives to all
negatives ( ( ) TN FP FP T + = / 2 ). Given probabilities p
j
of a model, the decision maker should
focus on choosing a threshold  such that her loss is minimized. To account for imbalances in
class size, the loss of a decision maker consists not only of
1 T and
2 T but also of unconditional
probabilities of positives ( ) 1 ) ( 1 = = h I P P j and negatives ( ) 1 2 1 0 ) ( P h I P P j ÷ = = = . The frequency-
weighted errors are then further weighted by policymakers' relative preferences between FNs
| | ( ) 1 , 0 e  and FPs ( )  ÷ 1 . This parameter may either be directly specified by the decision maker
or derived from a benefit/cost matrix. A standard 2x2 benefit matrix may easily be
manipulated to only include error costs by scaling and shifting entries of columns [3,5]. For
instance, the costs c for the entries of the matrix can be derived to a simpler matrix of class-
specific costs c
1
and c
2
with one degree of freedom:
TP FN
c c c ÷ =
1
and .
2 TN FP
c c c ÷ = These
class-specific costs may then easily be turned into relative preferences ( ) 2 1 1 c c c + =  and
( ) 2 1 2 1 c c c + = ÷  . Finally, the loss function is as follows:
( ) ( ) . 1 2 2 1 1 P T P T L    ÷ + = (5)
The specification of the loss function ( )  L enables computing the Usefulness of a model. A
decision maker could achieve a loss of ( ) 2 1, min P P by always issuing a signal of a crisis if
5 . 0 1 > P or never issuing a signal if . 5 . 0 2 > P When also paying regard to the preferences
between errors, we achieve the loss ( ) ( )   ÷ 1 , min 2 1 P P when ignoring the model. The Usefulness
a
U of a model is computed by subtracting the loss generated by the model from the loss of
ignoring it:
( ) ( ) ( ) ). ( 1 , min 2 1     L P P Ua ÷ ÷ = (6)
This measure highlights the fact that achieving beneficial models on highly imbalanced
data is challenging as a non-perfectly performing model is easily worse than always signaling
the frequent class. Hence, already an attempt to build a predictive model with imbalanced data
implicitly demands a decision maker to be more concerned about the rare class. Further, we
use a measure that computes the percentage of absolute Usefulness
a
U to a model's available
Usefulness ( ) ( )   ÷ 1 , min 2 1 P P :
( )
( ) ( )
.
1 , min
) (
2 1  


÷
=
P P
U
U
a
r (7)
The relative Usefulness
r U computes absolute Usefulness
a
U as a share of the Usefulness
that a decision maker would gain with a perfectly performing model. Hence,
r U is nothing
more than a rescaled measure of
a
U . Yet, the
r U provides means for better assessment of
Usefulness by extracting a number with a meaningful interpretation; performance can be
compared in terms of percentage points. When interpreting models, we can hence focus solely
on
r U .
The representation of the decision maker’s preferences may still be augmented by
accounting for instance-varying differences in importance. Let j w be an instance-varying
weight that approximates the importance of each instance j specified by the decision maker

8
and let j TP , j FP , j FN and j TN be binary vectors of prediction-realization combinations rather
than only their sums. By multiplying each binary element of the contingency matrix by j w , the
elements of the contingency matrix become importance-weighted sums, which may used
similarly as non-weighted sums for computing
1 T ,
2 T ,
1 P and
2 P . Let the elements of
1 T and
2 T be weighted by j w to have weighted type 1 and 2 errors:
( ), /
1 1
1
¿ ¿
= =
+ =
N
j
j j j
N
j
j j w
FN TP w FN w T (8)
( )
¿ ¿
= =
+ =
N
j
j j
N
j
j j w
TN FP w FP w T
1 1
2
/ (9)
As | | 1 , 0
1
e
w
T and | | 1 , 0
2
e
w
T are ratios of sums of weights rather than sums of binary values,
they now replace
1 T and
2 T in (5)-(7). Similarly, weighted unconditional probabilities
1 w
P and
2 w
P replace non-weighted
1 P and
2 P . This enables us to derive ( ) j w L ,  , ) , ( j a w U  and
) , ( j r w U  for given preferences and weights.
The above framework can be related to the more common cost-matrix approach [e.g. 3,5]
(see Table II for an example). After some simple algebra, the loss function ( ) j w L ,  takes the
following form:
( )
( ) ( )
( )
¿
¿
=
=
+ + +
÷ +
=
N
j
j j j j j
N
j
j j j j
j
TN FP FN TP w
FP w FN w
w L
1
1
1
,
 
 (10)
This derives to an instance-varying cost matrix with costs j w  for FNs and ( ) j w  ÷ 1 for
FPs. While constants could be added to these entries and their scaling may be modified, our
approach favors simplicity. Hence, the rationale for preferring this framework is that it enables
setting relative preferences of the errors, as well as includes a simple instance varying weight
that also functions as an input in weighting of learning algorithms. Setting specific costs for
each entry of the cost matrix is a difficult task in a real-world setting not only because the
problem with two degrees of freedom may be difficult to untangle, but also because most often
exact values of cost matrix entries are unknown.

3. Experiments with the WSOM
This section presents experiments on data with instance-varying costs in two financial
settings: financial crisis prediction and credit scoring. We compare the performance of the
WSOM with a classical SOM and with logit analysis. For setting the free parameters of the
SOM and WSOM, we follow a training schedule that focuses on classification performance.
After fitting the logit model, we set its absolute Usefulness ) , ( j a w U  as a benchmark. We fit
the SOM and WSOM for as many iterations as needed for ) , ( j a w U  on the train set to be equal
to or larger than that of a logit model. The radius of the neighbourhood  is set to begin as
half the diagonal of the grid size ( 2 / ) (
2 2 2
Y X + =  ) from where it decreases monotonically
towards zero. While comparisons of SOM and WSOM models with the same parameters
would be sufficient for assessing whether the inclusion of the weights improves performance,
we can now also test how the performance relates to a standard binary-choice method. Logit
analysis is a solid benchmark as it is one of the most common two-class prediction methods
used in both domains. While we do acknowledge that using logit performance as a stopping
criterion is not an optimal procedure for calibrating a classification model, the main focus of
this exercise is to compare relative performance rather than exploring how to derive optimal

9
models. Further, we test the differences of two types of SOM and WSOM models:
unsupervised and semi-supervised.
3.1 Predicting systemic financial crises
Recent financial crises have had global repercussions, impacting not only the source of
distress but also a large share of other economies, and therefore also require a global policy
approach. This has lead to a common factor of Early Warning Systems (EWSs); they utilize
pooled panel data (see e.g. [4,20,23-25]). This may also be motivated by the relatively small
number of crises in individual countries and by the strive to capture a wide variety of crises.
Hence, while having a time dimension, the pooled panel data include cross sections as well. In
an evaluation framework, as well as learning algorithm, this leads to the need for weighting
countries in terms of their importance for the policymaker (e.g., systemic relevance). To this
end, by integrating entity-specific and time-varying misclassification costs for a policymaker
into the learning rule of the WSOM, we attempt to build a model that better learns the patterns
of relevant instances. For instance, the repercussions of not calling a crisis in the US and in
Finland differ significantly.
The dataset includes quarterly data for 28 countries, 18 emerging market and 10 advanced
economies, for the period 1990:1–2010:4 (and sum up to 1,729 observations). While the data
are an unbalanced panel, the number of missing values is not a significant figure and refers
only to less than 5% of early dates when data availability was a limitation in some emerging
market economies. The occurrence of crisis can be represented with a binary state variable
{ } 1 , 0 ) 0 ( e j I (where instance j=1,2,…,N). The crisis definition follows that in Lo Duca and
Peltonen [23] by using a Financial Stress Index (FSI) of five components: the spread of the 3-
month interbank rate over the 3-month government bill rate, quarterly equity returns, equity
index volatility, exchange-rate volatility, and volatility of the yield on the 3-month
government bill. A crisis is defined to occur if the FSI of an economy exceeds its country-
specific 90th percentile. That is, out of data from 1990:1–2010:4, we define 10 % of the
quarters to be systemic events. The threshold is derived such that the events have led, on
average, to negative consequences for the real economy. Predicting the exact timing of distress
does not, however, provide enough reaction time for a decision maker. The wide variety and
changing nature of triggers may also complicate the task of identifying exact timings. To
enable policy actions for decreasing further build up of vulnerabilities, the focus is on
identifying pre-crisis periods { } 1 , 0 ) ( e h I j with a forecast horizon of h=24. Thus, we focus on
predicting vulnerable states, where one or multiple triggers could lead to a systemic crisis. The
literature has confirmed the existence of common patterns preceding financial crises [26,27].
Let ) (h I j hence equal one 24 months prior to the crisis episodes and zero otherwise. The
dataset consists also of 14 macro-financial indicators that proxy for a large variety of sources
of financial stress, such as asset price developments and valuations and credit developments
and leverage, as well as traditional macroeconomic measures, such as GDP growth and current
account imbalances. The variables are defined both on a domestic and a global level, where
the latter is an average of data for the Euro area, Japan, UK and US.
The dataset is partitioned into two sets: the train set (1990:4–2005:1) and test set (2005:2–
2009:2). We take the perspective of an external observer by setting the weight w
j
to be the
share of stock-market capitalization of country i in period t of the sum of stock-market
capitalization in the sample in period t. Though this does not follow decision theory, we see
this as a good proxy of system importance when the real costs are unknown. Hence, we use a
simplified measure of systemic relevance to gauge the importance of each entity for the
system.

10
We follow the standard SOM model in Sarlin and Peltonen [20] by setting the size of the
SOM and WSOM grids to be 13x10 units. We assume the relative costs of missing a crisis and
giving a false alarm to be 8 . 0 =  and hence 2 . 0 1 = ÷  . Conventional logit analysis and a
standard SOM model function as benchmark models. The logit model follows the standard
pooled estimation setting. Since the logit model is a replication of that in [20], interested
readers are still referred to that paper for further details on the experimental set up and
estimation methodology.
Table I presents the classification results for the WSOM, and the benchmark SOM and logit
models on test and train data. While the calibration of the models follows the above presented
schedule, the Usefulness of the models is tested with both weighted ( ) ) , (
j r
w U  and non-
weighted ( ) ) (
r
U performance. The table illustrates several findings of the WSOM's relative
performance on these data. It is worth noting that the focus is on test performance and relative
Usefulness
r
U , as train performance is similar by definition and
r
U is only a scaled version of
a
U . First, we concentrate on the weighted results ). , (
j r
w U  Comparing results of
unsupervised models and the logit model, unsupWSOM outperforms its competitors by a large
margin (more than 11 percentage points). A comparison of semi-supervised versions also
shows superiority of the WSOM (more than 7 percentage points), while being inferior to its
unsupervised counterpart, unsupWSOM, by a margin of 1 percentage point. The unsupSOM
outperforms the supSOM and the logit model. It is central to also notice that the WSOM needs
fewer training iterations to fulfill the stopping criterion (equal to the Usefulness on training
data of the logit model). This is seemingly a result of a better correspondence between the
evaluation and learning algorithm. Second, when comparing non-weighted results ) ( r U of all
models, we observe close to opposite relative results. Poor performance of unsupWSOM and
supWSOM illustrates the relevance of a correspondence between weighting in evaluation and
learning of the algorithms. The plots of weighted relative Usefulness ) , ( j r w U  for all
thresholds on train and test data in Fig. 1 show sensitivity to changes in thresholds (and thus
also preferences). The vertical lines represent optimal thresholds for each model. In general,
the figures show that a decision maker has to be relatively more concerned about missing
crises, rather than giving false alarms, and that the model performance for different thresholds
differs significantly, such as the superiority of supWSOM for | | 3 . 0 , 2 . 0 e  . It is worth noting
that interpretation of the graph is not entirely straightforward, as a change in ) , ( j r w U  of the
logit model may affect the training schedule of the SOM-based models. Overall, the WSOM
showed superior performance on the test data, when being evaluated in terms of a cost-
sensitive classifier by accounting for instance-varying weights.

11
TABLE I. WEIGHTED AND NON-WEIGHTED CLASSIFICATION PERFORMANCE OF FINANCIAL CRISES
λ U
a
(μ ,w
j
) U
r
(μ ,w
j
) λ U
a
(μ ) U
r
(μ )
Logit - Train - 0.80 0.21 0.09 61.50 % 0.14 0.06 41.20 %
unsupSOM Unsup. Train 3 0.80 0.19 0.10 63.60 % 0.07 0.10 64.30 %
unsupWSOM Unsup. Train 1 0.80 0.12 0.10 64.20 % 0.21 0.09 61.20 %
supSOM Semi-sup. Train 3 0.80 0.12 0.10 69.10 % 0.20 0.08 52.70 %
supWSOM Semi-sup. Train 1 0.80 0.15 0.10 66.40 % 0.17 0.08 53.70 %
Logit - Test - 0.80 0.21 0.02 10.70 % 0.14 0.03 19.10 %
unsupSOM Unsup. Test 3 0.80 0.19 0.01 8.50 % 0.07 0.05 31.90 %
unsupWSOM Unsup. Test 1 0.80 0.12 0.05 32.00 % 0.21 0.02 11.30 %
supSOM Semi-sup. Test 3 0.80 0.12 0.04 23.50 % 0.20 0.04 24.00 %
supWSOM Semi-sup. Test 1 0.80 0.15 0.05 30.80 % 0.17 0.01 8.70 %
Semi-sup./
Unsup. μ
Weighted Usefulness Non-weighted Usefulness
Notes: The train set spans from 1990:4–2005:1 and the test set from 2005:2–2009:2. Threshold λ is set to optimize in-sample U
a
(μ )
andU
a
(μ ,w
j
), while the same λ is applied to the out-of-sample data. Bolded figures highligt best-performing semi-supervised and
unsupervised SOM-based model for weighted and non-weighted evaluations on the test set. The abbreviations are as follows: λ , threshold;
U
a
(μ ), non-weighted absolute Usefulness; U
r
(μ ), non-weighted relative Usefulness; U
a
(μ ,w
j
), weighted absolute Usefulness; U
r
(μ ,w
j
),
weighted relative Usefulness. The weights w
j
represent the proportion of stock-market capitalization of country i in periodt to the sum
of stock-market capitalization in the sample in periodt .
Method Iterations Dataset

Weighted Relative Usefulness U
r
(µ,w
j
) on train data
Thresholds ì
U
s
e
f
u
l
n
e
s
s

U
r
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
Models
Logit
unsupSOM
unsupWSOM
supSOM
supWSOM
Thresholds
Weighted Relative Usefulness U
r
(µ,w
j
) on test data
Thresholds ì
U
s
e
f
u
l
n
e
s
s

U
r
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
Models
Logit
unsupSOM
unsupWSOM
supSOM
supWSOM
Thresholds

Figure 1. Weighted relative Usefulness on train and test data for all financial crisis models.

12
TABLE II. COST MATRIX FOR THE CREDIT SCORING APPLICATION
Actual

Predicted
Good Bad
Good 0 5
Bad 1 0
3.2 Credit scoring
The second application tests performance on the German credit scoring dataset that has
been provided to the UCI Machine Learning Repository [28] by Prof. Hans Hofmann. The
dataset includes information on 1000 past credit applicants, described by 24 numerical
variables and their credit rating. The explanatory variables contain information about a person
requesting a loan, including various demographic data, a summary of credit history, and the
amount of the loan. As we want to obtain a model that may be used to determine if new
applicants present a good or bad credit risk, the predicted variable { } 1 , 0 ) ( e h I j does not have a
time dimension and has thus a forecast horizon of h=0. The dataset is randomly partitioned
into 80% for training and 20% for testing. We apply the cost matrix shown in Table II
provided by Hofmann to set the preference parameter | |. 1 , 0 e  While it has been asserted to be
somewhat erroneous [3], it may still be used for setting relative preferences. Thus, the costs c
for the elements of the contingency matrix are derived to the following class-specific costs c:
1
1
= ÷ =
TP FN
c c c and 5 .
2
= ÷ =
TN FP
c c c . This leads to ( ) 83 . 0 2 1 1 ~ + = c c c  and
( ) 17 . 0 1 2 1 2 ~ + = ÷ c c c  .
This type of data with no time dimension obviously need to include a variety of different
customers (the cross-sectional dimension). Again, such as in the financial crisis case, this leads
to the need for weighting customers in terms of their importance for the decision maker. The
instance-varying costs in credit scoring are very explicit in nature and clearly illustrate the
need to weight customers differently. The cost of failing in scoring a credit application of a
customer is directly related to the size of the applied loan. Thus, we specify the weight w
j
to
gauge the importance of each customer for not only the learning of the model, but also the
evaluation of its performance.
Following the first application, we set the size of the SOM and WSOM grids to be 13x10
units. Table III presents the credit scoring results for the WSOM, SOM and logit models on
test and train data. The calibration of the models follows again the above presented training
schedule. The findings on the test set mainly corroborate the findings in the financial crisis
application. First, when comparing weighted results ) , ( j r w U  of all models, both WSOM
models outperform again their competitors. While the unsupWSOM oupterforms the SOM and
the logit model by a margin of 16 and 1.2 percentage points, the supWSOM still improves
performance by close to 4 percentage points. Hence, the margin to the logit model is not as
large as in the previous application. Consequently, the logit model is superior to both SOM
models. The number of iterations needed to reach the performance of the logit model is again
shown to be less for both WSOM models. Second, the non-weighted results ) ( r U of all
models show again poor performance for both WSOM models, while the logit model
outperforms the SOM models by a large margin (more than 10 percentage points). A
somewhat counterintuitive result is that unsupWSOM is better than unsupSOM, but this might
also reflect the poor weighted performance of the unsupSOM. The plots of weighted relative
Usefulness ) , ( j r w U  for all thresholds in Fig. 2 illustrate the variation in performance
depending on thresholds (and thus preferences). For instance, the poorly-performing supSOM
is shown to perform well for | |, 45 . 0 , 3 . 0 e  while supWSOM and unsupWSOM are shown to be
competitive for most of the thresholds.

13
TABLE III. WEIGHTED AND NON-WEIGHTED CLASSIFICATION PERFORMANCE OF CREDIT APPLICANTS
λ U
a
(μ ,w
j
) U
r
(μ ,w
j
) λ U
a
(μ ) U
r
(μ )
Logit - Train - 0.83 0.17 0.06 38.60 % 0.26 0.06 37.60 %
unsupSOM Unsup. Train 8 0.83 0.21 0.07 42.50 % 0.21 0.06 41.80 %
unsupWSOM Unsup. Train 4 0.83 0.29 0.07 45.80 % 0.21 0.06 38.90 %
supSOM Semi-sup. Train 12 0.83 0.21 0.06 39.20 % 0.21 0.04 28.70 %
supWSOM Semi-sup. Train 2 0.83 0.23 0.06 40.20 % 0.21 0.04 27.70 %
Logit - Test - 0.83 0.17 0.05 32.90 % 0.26 0.06 38.40 %
unsupSOM Unsup. Test 8 0.83 0.21 0.03 17.50 % 0.21 0.03 21.80 %
unsupWSOM Unsup. Test 4 0.83 0.29 0.05 34.10 % 0.21 0.04 22.20 %
supSOM Semi-sup. Test 12 0.83 0.21 0.03 17.10 % 0.21 0.04 27.70 %
supWSOM Semi-sup. Test 2 0.83 0.23 0.06 37.90 % 0.21 0.04 23.40 %
Semi-sup./
Unsup. μ
Weighted Usefulness Non-weighted Usefulness
Notes: The train set includes 800 applicants and the test set 200. Threshold λ is set to optimize in-sample U
a
(μ ) andU
a
(μ ,w
j
), while
the same λ is applied to the out-of-sample data. Bolded figures highligt best-performing semi-supervised and unsupervised SOM-based
model for weighted and non-weighted evaluations on the test set. The abbreviations are as follows: λ , threshold; U
a
(μ ), non-weighted
absolute Usefulness; U
r
(μ ), non-weighted relative Usefulness; U
a
(μ ,w
j
), weighted absolute Usefulness; U
r
(μ ,w
j
), weighted relative
Usefulness. The weights w
j
represent the scredit amount of an applicant.
Method Iterations Dataset

Weighted Relative Usefulness U
r
(µ,w
j
) on train data
Thresholds ì
U
s
e
f
u
l
n
e
s
s

U
r
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
Models
Logit
unsupSOM
unsupWSOM
supSOM
supWSOM
Thresholds
Weighted Relative Usefulness U
r
(µ,w
j
) on test data
Thresholds ì
U
s
e
f
u
l
n
e
s
s

U
r
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
Models
Logit
unsupSOM
unsupWSOM
supSOM
supWSOM
Thresholds

Figure 2. Weighted relative Usefulness on train and test data for all credit scoring models.

14
4. Conclusions and future research
This paper has presented a Weighted Self-Organizing Map (WSOM). The WSOM
combines the advantages of the standard SOM paradigm with learning that accounts for
instance-varying importance by augmenting its learning with a user-specified, instance-
specific importance weight. The WSOM is compared to a classical SOM and logit analysis in
two financial classification tasks: financial crisis prediction and credit scoring. The
performance of the WSOM for the two financial settings is shown to be superior in terms of
cost-sensitive classification performance. Future research should focus on the use of the
WSOM for unsupervised tasks. While the importance of weighting in classification is
straightforward, such as the size of a loan in classification of credit applicants, we illustrate the
less self-evident relevance for unsupervised tasks with two examples in customer
segmentation. First, customers oftentimes have a non-uniform importance for the analyst, e.g.
weights could represent potential in terms of income, wealth or distance to the store. Treating
all instances equally gives less relevant customers more influence than they should have, and
vice versa. Second, clustering a specific type of entities, whose crisp separation from a large
sample is not feasible, is a difficult task. A measure that indicates the extent they resemble the
desired type of entity could function as a weight, such as segmentation of eco-conscious
customers with weights as per the share of organic products in their product basket.
By setting the weight to be the importance of an instance for forming clusters, we also open
the door for a wide range of cost-sensitive unsupervised clustering applications.
References
[1] S. Lomax and S. Vadera, "A Survey of Cost-Sensitive Decision Tree Induction Algorithms," ACM Computing Surveys, in press.
[2] T. Fawcett and J . Foster, "Provost: Adaptive Fraud Detection," Data Mining and Knowledge Discovery, vol. 1(3), 1997, pp. 291-316.
[3] C. Elkan, "The foundations of cost-sensitive learning," Proc. of the International J oint Conference on Artificial Intelligence (IJ CAI 01),
2001, pp. 973–978.
[4] P. Sarlin, "On policymakers' loss functions and the evaluation of early warning systems," TUCS Technical Report, No. 1054, J un.
2012.
[5] T. Fawcett, "ROC graphs with instance-varying costs," Pattern Recognition Letters, vol. 27(8), 2006, pp. 882-891.
[6] J . Hollmen and M. Skubacz, "Input Dependent Misclassification Costs for Cost-Sensitive Classifiers," Proceedings of the International
Conference on Data Mining, 2000.
[7] B. Zadrozny and C. Elkan, "Learning and making decisions when costs and probabilities are both unknown," Proc. of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 01), 2001, pp. 204-213.
[8] T. Fawcett, "PRIE: a system for generating rulelists to maximize ROC performance," Data Mining and Knowledge Discovery, vol.
17(2), 2008, pp. 207-224.
[9] T. Kohonen, Self-Organizing Maps, 3rd edition, Berlin: Springer-Verlag, 2001.
[10] P. Sarlin, "Data and Dimension Reduction for Visual Financial Performance Analysis," TUCS Technical Report 1049, May 2012.
[11] G. Barreto, "Time series prediction with the self-organizing map: A review," in Perspectives on Neural-Symbolic Integration, P. Hitzler
and B. Hammer, Eds., Berlin: Springer-Verlag, 2007, pp. 135-158.
[12] P. Sarlin, "Self-Organizing Time Map: An Abstraction of Temporal Multivariate Patterns," Neurocomputing, in press.
[13] S. Kaski, T. Honkela, K. Lagus and T. Kohonen, "WEBSOM--self-organizing maps of document collections," Neurocomputing, vol.
21, 1998, pp. 101-117.
[14] G. Chappell, J . Taylor, "The temporal Kohonen map," Neural Networks, vol. 6, 1993, pp. 441-445.
[15] T. Kohonen, "Things you haven't heard about the Self-Organizing Map", Proceedings of the International Conference on Neural
Networks (ICNN 93), 1993, pp. 1147-1156.
[16] K. Y. Kimand J . B. Ra, "Edge preserving vector quantization using self-organizing map based on adaptive learning," Proc. of the
International J oint Conference on Neural Networks (IJ CNN 93), vol. 11, 1993, IEEE Press, pp. 1219-1222.
[17] J . Kangas, "Sample weighting when training self-organizing maps for image compression," Proceedings of the 1995 IEEE Workshop
on Neural Networks for Signal Processing, 1995, pp. 343-350.
[18] J . Vesanto, J . Himberg, E. Alhoniemi and J . Parhankangas, "Self-Organizing Map in Matlab: the SOM Toolbox," Proceedings of the
Matlab DSP Conference, 1999, pp. 35-40.
[19] Z. Yao, P. Sarlin, T. Eklund, B. Back, "Combining Visual Customer Segmentation and Response Modeling," Proceedings of the
European Conference on Information Systems (ECIS 12), J un. 2012.
[20] P. Sarlin and T. Peltonen, "Mapping the State of Financial Stability," ECB Working Paper, No. 1382, Sept. 2011.
[21] J .C. Forte, P. Letrémy, M. Cottrell, "Advantages and drawbacks of the Batch Kohonen algorithm," Proceedings of the European
Symposium on Artificial Neural Networks (ESANN 02), Springer-Verlag, 2002, pp. 223–230.

15
[22] T. Kohonen, "The Hypermap Architecture," in Artificial Neural Networks, Vol. II, T. Kohonen, K. Mäkisara, O. Simula and J . Kangas,
Eds., Amsterdam :Elsevier, 1991, pp. 1357-1360.
[23] M. Lo Duca and T.A. Peltonen, "Assessing Systemic Risks and Predicting Systemic Events," J ournal of Banking & Finance, in press.
[24] A-M. Fuertes, E. Kalotychou, "Early warning systems for sovereign debt crises: The role of heterogeneity," Computational Statistics
and Data Analysis, vol. 51(2), Nov. 2006, pp. 1420-1441.
[25] M. Kumar, U. Moorthy and W. Perraudin, "Predicting emerging market currency crashes," J ournal of Empirical Finance, vol. 10(4),
Sep. 2003, pp. 427-454.
[26] C.M. Reinhart, K.S., Rogoff, "Is the 2007 US Sub-Prime Financial Crisis So Different? An International Historical Comparison,"
American Economic Review, vol. 98(2), 2008, pp. 339-344.
[27] C.M. Reinhart, K.S., Rogoff, "The Aftermath of Financial Crises," American Economic Review, vol. 99(2), 2009, pp. 466-472.
[28] C. Blake and C. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html.

16

ISBN 978-952-12-2802-5
ISSN 1239-1891

data mining

Comments

Content

Sponsor Documents

Recommended