GPU Acceleration

Published on June 2016 | Categories: Documents | Downloads: 28 | Comments: 0 | Views: 278
of 10
Download PDF   Embed   Report

Comments

Content


GPU acceleration for the pricing of the eMS spread option
Qasim Nasar-Ullah
University College London
Gower Street
London, United Kingdom
[email protected]
ABSTRACT
This paper presents a study on the pricing of a fnancial
derivative using parallel algorithms which are optimised to
run on a GPU. Our chosen fnancial derivative, the con­
stant maturity swap (CMS) spread option, has an associ­
ated pricing model which incorporates several algorithmic
steps, including: evaluation of probability distributions, im­
plied volatility root-fnding, integration and copula simula­
tion. The novel aspects of the analysis are: (1) a fast new ac­
curate double precision normal distribution approximation
for the GPU (based on the work of Ooura) , (2) a parallel grid
search algorithm for calculating implied volatility and (3) an
optimised data and instruction workfow for the pricing of
the CMS spread option. The study is focused on 91.5% of
the runtime of a benchmark (CPU based) model and results
in a speed-up factor of 10.3 when compared to our single­
threaded benchmark model. Our work is implemented in
double precision using the NVIDIA GFI00 architecture.
Categories and Subject Descriptors
D.1.3 ¸ Concurrent Programming¦ : Parallel Programming;
G.1.2 ¸Approximation¦ : Special function approximations;
G.3 ¸Probability and Statistics¦ : Probabilistic algorithms
(including Monte Carlo)
General Terms
Algorithms, Performance
Keyords
GPU, Derivative pricing, CMS spread option, Normal dis­
tribution, Parallel grid search
1. INTRODUCTION
Modern graphics processing units (GPUs) are high through­
put devices with hundreds of processor cores. GPUs are able
to launch thousands of threads in parallel and can be con­
fgured to minimise the efect of memory and instruction
latency by an optimal saturation of the memory bus and
arithmetic pipelines. Certain algorithms confgured to the
GPU are thought to ofer speed performance improvements
over existing architectures. In this study we examine the
application of GPUs for pricing a constant maturity swap
(CMS) spread option.
The CMS spread option, a commonly traded fxed income
derivative, makes payments to the option holder based on
the diference (spread) between two CMS rates C1, C2 (e.g.
the ten and two year CMS rates). Given a strike value K, a
CMS spread option payof can be given as [C1 - C2 - K]+,
where [
.
]+ = Max[·, 0]. The product makes payments, based
on the payof equation, to the holder at regular intervals
(e.g. three months) over the duration of the contract (e.g.
20 years). The CMS rates C1, C2 are recalculated at the
start of each interval and the payof is made at the end of
each interval.
Prior to discussing our GPU based CMS spread option
model in Sections 4 and 5, we use Sections 2 and 3 to present
two algorithms that are used within our model. Section 2
presents an implementation of the standard normal cumula­
tive distribution function based on the work of Ooura [15].
The evaluation of this function is central to numerous prob­
lems within computational fnance and dominates the cal­
culation time of the seminal Black Scholes formula [4]. We
compare our algorithm to other implementations and dis­
cuss sources of performance gain, we also comment on the
accuracy of our algorithm. In Section 3 we present a GPU
algorithm that evaluates implied volatility through a parallel
grid search. The calculation of implied volatility is regarded
as one of the most common tasks within computational f­
nance [12]. Our method is shown to be robust and is suited
to the GPU when the number of implied volatility evalua­
tions is of order 100 or less. In Section 4 we present a short
mathematical model for the pricing of CMS spread options.
In Section 5 we present our GPU based implementation,
providing various example optimisations alongside a set of
performance results.
2. NORMAL DISTRIBUTION FUNCTION
ON THEGPU
The calculation of the standard normal cumulative distri­
bution function, or normal CDF, occurs widely in compu­
tational fnance. The normal CDF, <(x) , can be expressed
as:
<(x) =
 
1
_ jX _�ì
_
,2
¿
¸
y �o
(1)
and is typically calculated from a numerical approximation
of the error function erf (x) and the complementary error
function erfc(x), which are shown in Figure 1. Approxima­
tions for erf (x) and erfc(x) are often restricted to positive
values of x which are related to <(x) by:
978-1-4673-2633-9/12/$31.00 ©2012 IEEE
1
U
=Erf(x)
= Erfc(x)
-<(x)
-1 �������-�-�-�
ö 4 -2 0 2 4 ö
x
Figure 1: Overlaying the functions erf (x) , erfc(x)
alongside the normal CDF <(x) . Due to the symme­
try of these functions, algorithms typically restrict
actual evluations to positive values of x.
<(
+x) = �
[
1+ erf
(
:)]

[
2 ÷erfc
(
:) ]
, (
2
)
<(
-x) = �
[
1÷erf
(
:) ]
= � erfc
(
:)
. ¸3)
The majority of normal CDF algorithms we surveyed ap­
proximate the inner region of x (close to x = 0) using erf (x)
and approximate the outer region of x using erfc(x) , this
minimises cancellation error. The optimum branch point
separating the inner and outer regions and minimising can­
cellation error is x � 0. 47 ¸ö¦ , this point is the intersection
of erf (x) and erfc(x) shown in Figure 1. The algorithms im­
plemented within our study are listed in Table 1 and have
absolute accuracy < 10¯'º.
An algorithmic analysis of various common approxima­
tions ¸1, ö, 10, 11¦ highlights areas of expected performance
loss when implemented within CPUs. Firstly, the approxi­
mations are rational and so utilise at least one instance of
the division operator ( which has low throughput on the cur­
rent generation of CPUs) . The common presence of ratio­
nal approximations, for example Pade approximants, stems
from their superior numerical efciency on traditional archi­
tectures ¸7¦ . Secondly, due to separate approximations over
the range of x, the CPU may sequentially evaluate each ap­
proximation ( known as branching) if the thread execution
vector (known as a 'warp' of 32threads in current NVIDIA
architectures) contains values of x in diferent approxima­
tion regions. An algorithm falling into this class is the Cody
algorithm ¸ö¦ , which is also considered a standard implemen­
tation within fnancial institutions and is used to benchmark
our results.
Within our survey we identify the Ooura error function derf
and complimentary error function derfe ¸1ö¦ as being par­
ticularly suited to the CPU.
The Ooura error function derf is based on polynomial
(as opposed to rational) approximations where each approx­
imation utilises high throughput multiplication and addi­
tion arithmetic only. The algorithm uses two explicit if
branches, each having access to fve sets of coefcients. As a
result the algorithm consists of ten separate polynomial ap-
10 -,----,-,----,-
8 - ±
_.. .. ³
ö -
±
¬
- - - Outer approximation
-Inner approximation
1 2 3 4 ö ö
x
Figure 2: The diferent polynomial approximations
within the Ooura error function derf over x. We
observe two explicit branches (each having fve sub­
branches) . The domain of x will be expanded by a
factor of ,� 1. 4 when evaluating the normal CDF
due to the transformation in (2) and (3) .
proximations operating on ten distinct regions of x, the ten
regions can be seen in Figure 2. Having hard-coded polyno­
mial coefcients, as opposed to storage in another memory
type, ofered the best performance. It is worthwhile to note
that the addition of a single exponential function exp or sin­
gle division operation increased the execution time of a single
derf approximation by around ö0%and 2ö%respectively.
In contrast, the Ooura complimentary error function derfe
uses a single polynomial approximation across the entire
range of x, whilst utilising two instances of low throughput
operations (namely exp and a division) . We were unable
to fnd a more parsimonious representation (in terms of low
throughput operations) of the complimentary error function
within our survey.
We formulate a hybrid algorithm called ONORM to calcu­
late the normal CDF (listed in Appendix A) . The algorithm
uses the innermost approximation of derf to cover the in­
ner region ¸±¸¦ � ¸±1. 4¦ ( where ¸±d¦ = ÷d < x < d) and
derfe for the remaining outer region. The resulting branch
point of x = 1. 4is greater than the optimum branch point of
x = 0. 47and was chosen to maximise the interval evaluated
by the higher performance derf approximation.
Our results are shown in Table 1,within which we compare
the efects of uniform inputs of x, which increment gradually
to minimise potential branching, against random inputs of
x, which are randomised for increased branching. Our re­
sults show that ONORM ofers its peak performance when
x is in the inner range of ¸±1. 4¦ . In this range it slightly
outperforms derf (upon which ONORM is based) due to
fewer control fow operations. Within our test samples we
observe ONORM outperforming the Cody algorithm by fac­
tors ranging from 1. 09 to 3. 0ö.
Focusing on the random access results we see that when
x is in ¸±10¦ ONORM performs slower than derfe due to
each 'warp' executing both the inner derf and outer derfe
approximations with probability � 0. 99. The performance
diference is however limited since the inner derf approxima­
tion has around 4ö%of the cost of the derfe approximation.
The use of random (against uniform) inputs signifcantly re-
Range of x ¸±0. 2¦ ¸±1.4¦ ¸±10¦
Access Random Uniform Random Uniform Random Uniform
derf 4. 72 4. 7ö 4. 81 4. 82 1. 0ö 4. öö
derfc 2. 22 2. 24 2. 23 2. 24 2. 23 2. 24
Phi 2. 40 2. 7ö 1. 34 1. 84 1. 08 1. 21
NV ö. 33 ö. 34 ö. 3ö ö. 37 1. ö4 2. 1ö
Cody 4. 3ö 4. 3ö 1. 34 2. 39 0. 9ö 1. ö3
ONORM 4. 77 4. 81 4. 88 4. 83 1. 71 2. 28
ONORM speed-up vs. Cody ( Z ) 1. 09 1. 10 3. öö 2. 02 1. 78 1. 40
Table 1: Calculations per second [×10") for the Ooura derf, Ooura derfc, Marsaglia Phi, NV, Cody and our
ONORM algorithm. Active threads per possible active threads or 'occupancy' fxed at 0. ö. Uniform access
attempts to minimise possible branching, Random access attempts to maximise possible branching. GPU
used: NVIDIA M2070.
|.ð X 10¨''
1. X |0¨''
ð. Z |0¨'´
v
0
-ð. Z |0¯'´
-1. Z |0¯''
-1.ðZ |0¯''
-2 -1 0 2
²
Figure 3: Absolute error AE of the derf algorithm
for the range ¸±2¦ .
duces the performance of Cody in the ranges ¸±1.4¦ and
¸±10¦ and derf in the range ¸±10¦, highlighting the perfor­
mance implications of branching.
We also comment on the Marsaglia Phi algorithm which
is based on an error function approximation ¸13¦ . It is a
branchless algorithm which involves the evaluation of a Tay­
lor series about the origin. A conditional while loop adds
additional polynomial terms as x moves away from the ori­
gin. Within our GPU implementation we add a precalcu­
lated array for Taylor coefcients to eliminate all division
operations. The algorithm performs single digit iterations
close to the origin, and grows exponentially as x moves to­
wards the tails. We found that despite having extremely
few iterations close to the origin, performance is limited by
the presence of a single exp function. Our results also in­
dicate that the Marsaglia algorithm is always dominated by
the derf function (unless extensive branching occurs) and
can perform at least as fast as the derfc function when x is
within ¸±0. 2|.
A comparison is also made against the latest NVIDIA
CUDA 4. 1 implementation ¸14¦ of the error and compli­
mentary error functions: NV-Erf and NV-Erfc. As per the
ONORM algorithm we can craft a hybrid algorithm NV which
uses the innermost NV-Erf approximation for the inner re­
gion ¸±1. 4¦ (consisting of multiplication and addition arith­
metic only) and the branchless NV-Erfc approximation (con­
sisting of three low throughput functions: an exp and two
divisions) in the outer region. The inner approximation is
more efcient than ONORM due to a smaller polynomial,
2. X 10¨''
1. X 10¯''
v
-1. X 10¨''
-2. X 10¯''
-2 -1 0 2
²
Figure 4: Absolute error AE of the derfc algorithm
for the range ¸±2¦ .
whereas the outer approximation uses an additional division
operation yielding a loss in performance.
The accuracy of our GPU approximations can be mea­
sured against an arbitrary precision normal CDF function
implemented within Mathematica CDF [NormalDistribution],
referred to as
<actual(
X
)
. We measure absolute accuracy as:
AE =
<
(
x
) - <actual(
x
),
¸4)
and relative accuracy ( which is amplifed for increasingly
negative values of x
)
as:
RE =
< (
x
)
_ 1.
<actual (
x
)
¸ö)
The ONORM branches combine to reduce cancellation er­
ror by using the inner region of derf and the outer region
of derfc, this can be observed in Figures 3 and 4. How­
ever, having chosen our branch point as x = 1. 4rather than
x = 0. 47, we observe in the range ÷1. 4 < x < ÷0. 47 a
small increase in the ONORM maximum relative error of
3. 23 Z 10¯'º.
We also compare the accuracy of ONORM against the NV
and Cody algorithms. Comparative relative error plots are
shown in Figures ö and ö. Over the range ÷22 < x < 9
the NV, Cody and ONORM algorithms exhibit comparable
relative error.
As seen in Figure ö,within the inner region ¸±1. 4¦ , ONORM
was inferior to both NV and Cody. It is apparent therefore
18
|

- t v


0

14 -
12

÷20 ÷1ö ÷10 ÷ö 0 ö
x
Figure ô: Maximum bound of relative error RE for
the NV, Cody and ONORM algorithms for the range
÷22 < x < 9. Lower values depict higher relative
error.
1ö.ö

1ö.ö

-Cody
- - - ONORM
14.ö


2
   


1
  
0


1


2
x
Figure ö: Maximum bound of relative error RE for
the NV, Cody and ONORM algorithms for the range
¸±2[. Lower vlues depict higher relative error.
that the inner region of ONORM can be improved in terms
of speed and accuracy by utilising the error function approx­
imation within NV-erf. The maximum absolute errors of NV,
Cody and ONORM were all less than 1. öö Z 10¯'º.
3. IMPLIED VOLATILITY ON THE GPU
The evaluation of Black ¸3[style implied volatility occurs
in many branches of computational fnance. The Black for­
mula (which is closely related to the seminal contribution
in derivative pricing, the Black Scholes formula ¸4¦ ) calcu­
lates an option price Vas a function V(S, K, u, '), where
S is the underlying asset value, K is the strike value, U is
the underlying asset volatility and T is the time to matu­
rity. The implied volatility calculation is based on a simple
formula inversion where the implied volatility Ui is now a
function Oi (Vm, S, K, '), where Vm is an observed option
price. Due to the absence of analytic methods to calculate
implied volatility, an iterative root-fnding method is typi­
cally employed to fnd Ui such that:
V(S, K, Ui, T)÷Vm = 0. ¸ ö)
The function V (
.
) appears well suited for efcient root
fnding based on Newton's method, for example, the function
Thread block size Q
0 BS 1ö 32 ö4 128 ö12
10¯º 27 7 ö ö 4 3
10¯º 34 9 7 ö ö 4
10¯'º 40 10 8 7 ö ö
10¯'` 47 12 10 8 7 ö
10¯'' ö4 14 11 9 8 0
Table 2: Iterations needed, Ü
)
¿ to calculate implied
volatility based on a parallel grid search. Domain
size d = 100. 0 represents search accuracy. BS rep­
resents binary search.
is monotonically increasing for u, the function has a single
analytic infexion point with respect to U and the function
has analytic expressions for its derivative with respect to U
(8V/ 8u). However, within the context of implied volatility
8V / 8u can tend to 0,resulting in non-convergence ¸12[. Pre­
dominant methods to evaluate implied volatility are there­
fore typically based on Newton with bisection or Brent­
Dekker algorithms ¸1ö¦ , the latter being preferred. GPU
based evaluation of these functions (particularly Brent-Dekker)
can result in a loss of performance. Firstly, the high regis­
ter usage of these functions is generally suboptimal on the
light-weight thread architecture of GPUs. Secondly, the al­
gorithms may execute in a substantially increased runtime
due to conditional branch points coupled with unknown it­
erations to convergence. Finally, numerous contexts in com­
putational fnance (including the CMS spread option model)
are concerned with obtaining the implied volatility of small
groups of options, hence single-thread algorithms such as
Newton and Brent-Dekker can result in severe GPU under­
utilisation (assuming sequential kernel launching).
We therefore develop a parallel grid search algorithm that
has the following properties: it uses a minimal amount of
low throughput functions, it is branchless and executes in
a fxed time frame and can be used to target an optimum
amount of processor utilisation when the number of implied
volatility evaluations is low.
The parallel grid search algorithm operates on the follow­
ing principles: we assume the domain of Ui is of size d and
the required accuracy of Ui is 0. The required accuracy can
be guaranteed by searching over U units, where
¸7)
Using a binary search method (which halves the search
interval by a factor of two with each iteration) the number
of required iterations Ü¿ is given by:
¸8)
where "1 is the ceiling function and i = ,xl is the smallest
integer i such that i :: x.
Alternately a parallel grid search can be employed using
a GPU 'thread block' with Q threads and Q search areas (a
thread block permits groups of threads to communicate via
a 'shared memory' space). Using a parallel grid search, the
number of required iterations Ü
)
is given by:
¸9)

2
I
¯
~
Z
¯
1.5
¯
¯
´
O
Õ
~
1
O
¯
J
¯
¯
=
0.5
7
"

´
� æ
4
 
_
4
I _f° ² _
_
; .
,, , � -` -�
°

f æ¯ .0
°
l* _»
¸
*
°
·
¿ 0
°

æ
`
`
, ¸ ,·
¸
.
`
  ·³³³³
.
16 threads
� �� - ¬ - - 32 threads
¬64 threads
-96 threads
-128 threads
100 200 300 400 500
No. of implied volatility calculations
Figure 7: Calculations per second of a parallel grid
search implied volatility for vrious thread block
sizes. The number of implied volatility calculations
is equal to the number of thread blocks.
400
200
...... 16 threads
- - - 32 threads
-64 threads
-96 threads
-128 threads
50 100 150 200 250
No. of implied volatility calculations
300
Figure 8: Time taken for the evaluation of a paral­
lel grid search implied volatility for vrious thread
block sizes. The number of implied volatility calcu­
lations is equal to the number of thread blocks.
The number of iterations needed against various thread
block sizes ¿ to guarantee a given accuracy 0 over a given
domain size dcan be estimated a priori, of which an example
is shown in Table 2.
The parallel grid search algorithm will thus calculate the
implied volatility of 0 options by launching 0 thread blocks,
with each thread block having ¿ threads. It is noted that
while numerous CPU algorithms seek to maximise thread
execution efciency (since the number of threads are fxed),
we are primarily concerned with maximising thread block
execution efciency (since the number of 0 thread blocks or
0 options are fxed). We ofer a brief outline of the key steps
within a CUDA implementation:
1. Within our listing we use the following example pa­
rameters which can be passed directly from the CPU,
where base is a preprocessed variable to enable the di­
rect use of high throughput bitshifting on the CPU and
where the thread block size blocksize (¿) is assumed
to be a power of two:
left = 1.0e-9; //minimum volatility
delta = 1.0e-12; //desired accuracy
base = lo
g
2f(blocksize);
2. We begin a for loop over the total iterations iter as
calculated in ¸9). The integer mult is used to collapse
the grid size with each iteration, its successive val­
ues are calculated as ¿
.r«r-1_
¿
.r«r-z___¸ _
¿
1_
¿
_
. Sub-
sequently we calculate our volatility guess vol ( O¿ ) and
the error err as given by the left hand side of ¸ö) by:
for (int i = 0; i<iter; i++)
i
mult = l«(base * (iter - 1 - i»;
//vol
g
uess over the entire interval
vol = left + delta*mult*threadldx.x;
//calculation of error: V(vol, ... ) - V m
err = price(vol, ... ) - price_m;
The volatility guess now covers the entire search inter­
val. If in ¸9) U is chosen not be a power of ¿
_
our frst
iteration overstates our initial search size d (which ex­
tends out to the right of the left interval left). This
does not afect the algorithm's accuracy and minimally
afects performance due to a partially redundant addi­
tional iteration. Special care must be taken to ensure
mult does not overfow due to excessive bitshifts, in
our fnal implementation we avoided this problem by
using additional integer multiples (e.g. mult2, mult3).
3. We found it optimal to declare separate shared mem­
ory arrays (prefxed sh_) to store the absolute error
and the sign of the error. This prevents excessive us­
age of the absolute function fabs within the reduction.
A static index is also populated to provide the left in­
terval location for the next iteration:
//absolute error
sh_err[threadldx.x] = fabs(err);
//si
g
n of error
sh_si
g
n[threadldx.x] = si
g
nbit(err);
//static index
sh_index [threadldx.x] = threadldx.x;
4. After a parallel reduction to compute the index of
the minimum absolute error (stored in sh_index [0] ) ,
the left bracket is computed by checking the sign of
the minimum error location using properties of the
si
g
nbit function:
//V(vol, ... ) - V_m < 0
if (!sh_si
g
n[sh_index[O]])
left = left + (sh_index [O] - l)*delta*mult;
//V(vol, ... ) - V_m - 0
else left = left + sh_index [0] *delta*mult;
J
Our results showing the calculations per second of various
thread block sizes are shown in Figure 7, where a number of
efects are visible. Firstly, consider thread block sizes of ö4,
9ö and 128 threads. As we increase the number of options
for which we compute implied volatility, the calculations per
second become a strong linear function of the number of
thread blocks launched, exhibited in the plateauing efect
to the right of Figure 7. Secondly, consider thread block
sizes of 16 and 32 threads. It is observed that these kernels
maintain load imbalances where additional calculations can
be undertaken at no additional cost. In our example peak
performance was achieved by thread blocks of size 32. The
lower peaks associated with 16 threads ¸as opposed to 32
threads) is a consequence of 16 threads utilising an addi­
tional two iterations as given by ¸9) .
By studying Figure 8we observe how load imbalances are
linked to the 'number of passes' taken through the GPU.
We introduce this phenomenon as follows: The GPU used
in this study consisted of 14 multiprocessors each accommo­
dating upto 8thread blocks, thus a maximum of 14 Z 8= 112
thread blocks are active on the GPU at any given instance.
In our implementation this was achieved by having 16, 32
and 64 threads per block, where we observe an approximate
doubling of execution time as we vary from 112 to 113 total
thread blocks. This is due to the algorithm scheduling an ad­
ditional 'pass' through the GPU. Focusing on larger thread
block sizes we see that a similar 'pass' efect is observed
limited to the left hand side of Figure 8. Due to the large
diferences in time, assessing the number of passes should be
carefully studied before implementing this algorithm. The
number of passes can be estimated as follows:
Number of passes =
r _
¡ :
^
M1
'
¸ 10)
where 0 is the total number of thread blocks requiring eval­
uation, ' is the total number of thread blocks scheduled
on each multiprocessor and ^
M
is the total number of mul­
tiprocessors on the GPU.
_
¡ should be obtained by hard­
ware profling tools as the hardware may schedule diferent
numbers of thread blocks per multiprocessor than would be
expected by a static analysis of processor resources.
In order to provide a comparison against a single-thread
algorithm on the GPU ¸that is the number of implied volatil­
ity calculations is equal to the number of GPU threads
launched) we implement a parsimonious representation of
Newton's method ¸we preprocess input data to ensure con­
vergence) . Although the number of Newton iterations may
vary substantially, by comparing Figures 8 and 9 we con­
clude that the parallel grid search algorithm is likely to ofer
a comparable runtime when the number of implied volatility
evaluations is of order 100.
Within our parallel grid search algorithm the size of thread
blocks is equal to the size of the parallel grid g. An alternate
algorithm would instead accommodate multiple parallel sub­
grids within a single thread block ¸with each sub-grid evalu­
ating a single implied volatility) . An increase in the number
of sub-grids per thread block would result in a decrease in
the number of blocks needed for evaluation whilst increasing
the number of iterations and control fow needed. Such an
approach is advantageous when dealing with large numbers
of implied volatility evaluations. For instance, using Figure
8 when the number of evaluations is between 113 and 224,
a thread block of 64 threads makes two passes, however if
the thread block was split into two parallel sub-grids ¸of 32
threads) only one pass would be needed. Thus, in this in­
stance, the total execution would approximately half. The
use of a single-thread algorithm is ultimately an idealised
version of this efect where each sub-grid is efectively re-
300
I
I
I
I I
i = Newton iterations


7
2 200
~
i = 25 -
4


i=20-

4 i = 15 -
S 100

i = 10 -

i I I I |
50 100 150 200 250 300
Size of GPU thread block
Figure Û: Time taken for the evaluation of ideal
Newton gradient descent implied volatility, the low­
est line represents fve iterations, lines increment by
one iteration, highest line represents 28 iterations.
Results are obtained from the execution of a single
thread block.
duced to a single thread, drastically reducing the number of
passes ofset against increased timeJcomplexity whilst eval­
uating on a single thread.
4. MATHEMATICAL MODEL FOR CMS
SPREAD OPTION PRICING
The CMS spread option price is calculated by frstly esti­
mating stochastic processes to describe the two underlying
CMS rates C1, C2 and secondly, using such stochastic pro­
cesses to obtain ¸via a copula simulation) the expected fnal
payof [C1 - C2 - K[¯. These two steps are repeated for
each interval start date and the option price is subsequently
evaluated by summing the discounted expected payofs re­
lating to each interval payment date. We describe the model
evaluation in more detail using the following steps:
1. As stated, we frst require a stochastic process to de­
scribe the underlying CMS rate. The frst step to ob­
taining this process is to calculate the CMS rate C it­
self using the put-call parity rule. This calculates the
CMS rate C as a function of the price of a CMS call
option ¸ CallK) , CMS put option ¸PUtK) and a strike
value K. We set K = an observable forward swap rate.
Put-call parity results in the value of C as:
C = CallK - PutK +K. ¸ 11)
In order to calculate the price of the CMS options
¸ CallK, PutK) we follow a replication argument ¸8[
whereby we decompose the CMS option price into a
portfolio of swaptions R(k) which are evaluated at dif­
ferent strike values k ¸swaptions are options on a swap
rate for which we have direct analytic expressions).
The portfolio of swaptions can be approximated by an
integral ¸8[of which the main terms are:
CallK � L= R(k)dk, ¸ 12)
(13)
2. With the CMS rate C captured we also require infor­
mation regarding the volatility smile efect [9]. The
volatility smile describes changing volatilities as a re­
sult of changes in an option's strike value K. We
therefore evaluate CMS call options (CallK) at various
strikes surrounding C and calculate the corresponding
option implied volatilities.
3. We calibrate a stochastic process incorporating the
above strikes, prices and implied volatilities. We thus
obtain unique stochastic processes, expressing the volatil­
ity smile efect, for each of the underlying rates C1, C2.
The stochastic process is typically based on a SABR
class model [9].
4. The price of a spread option contract based on two
underlyings C1, C2 with the payof [C1 - C2 - K]
+
is:
(14)
where f( C1, C2) is a bivariate density function of both
underlyings and A is the range of the given density
function. Obtaining the bivariate density function is
non-trivial and a standard approach is to instead calcu­
late (14) using copula methods [6]. The copula method
allows us to estimate a bivariate density function f( C1, C2)
through a copula C. The copula is a function of the
component univariate marginal distributions F1, F2 (which
can be directly ascertained from our stochastic pro­
cesses for C1, C2) and a dependency structure µ (for
example a historical correlation between C1, C2). The
price of a spread option can thus be given as:
1
1
1
1
[F1
-
1
(U1) - F2
-
1
(U2) - K]
+
C(U1, U2)du1du2, (15)
where (U1, U2) are uniformly distributed random num­
bers on the unit square.
The integral is subsequently approximated by a copula
simulation. This involves frstly obtaining a set of ^
two-dimensional uniformly distributed random num­
bers (U1, U2), secondly, incorporating a dependency
structure between (U1, U2) and fnally obtaining a pay­
of by using the inverse marginal distributions F
1
-
1
,
F;
1
on (U1, U2)' The fnal result will be the average
of the ^ simulated payofs.
5. GPU MODEL IMPLEMENTATION
The mathematical steps of the previous section consist of
four dominant computational tasks (in terms of the time
taken for computation). Based on these tasks it is instruc­
tive to relabel our model into the following stages:
Integration An integration as shown in (12) and (13), re­
lating to steps 1 and 2 in Section 4.
,---------
Load Data
l
  --- ----
l
Calibration
,
 . ----
---- --
CPU
GPU
Figure 10: Model fowchart for the CMS spread op­
tion.
Calibration A calibration to obtain CMS rate processes,
relating to step 3 in Section 4. This is not implemented
within our CPU model.
Marginals The creation of lookup tables that represent
discretised univariate marginal distributions F1, F2 for
an input range C\, C2. This allows us to evaluate
the inverse marginal distributions F
1
-1 , F
2
-1 on (U1, U2)
through an interpolation method, relating to step 4 in
Section 4.
Copula Simulation of the copula based on (15), relating to
step 4 in Section 4.
When presenting the timing results of each of the above
stages we include the cost of CPU-CPU memory transfers.
A fowchart describing the evaluation of the CMS spread op­
tion model is shown in Figure 10, within which we have an
additional CPU operation (Load Data) which loads a set of
market data for the calibration stage. The function is over­
lapped with part 2 of the integration stage, hence there is
no time penalty associated with this operation. As a result
of our CPU implementation we obtain performance results
as shown in Table 3. Within our results we set the num­
ber of intervals or start dates t = 96; more generally t is
in the range [40, 200]. The speed-up impact upon changing
t is minimal since our benchmark model is a strong linear
function of
¿
Ü are the calibration and copula stages which
account for 96.9% of the execution time within our fnal
CPU implementation. Results are based on an M2070 CPU
and an Intel Xeon L5640 CPU with clock speed 2.26 CHz
running on a single core. Compilation is under Microsoft Vi­
sual Studio with compiler fags for debugging turned on for
both the CPU and CPU implementations. Preliminary fur­
ther work suggests that the use of non-debug compiled ver­
sions results in a signifcantly larger proportional time reduc­
tion for the CPU implementation. This indicates that the
stated fnal speed-up results are an underestimate. Within
Time Speed-up Main Kernel Stats ¸%)
¸ms) ¸ Z ) Time Replays Ll Hit
V1 1ö. 41 22. 87 38. 01 ö. 47 ö4. 12
V2 9. ö0 37. 10 41. 78 ö. 78 ö3. 18
V3 7. 2ö 48. ö9 21. 43 0. 13 92. 80
Table 4: Integration Results: VI = Separate under­
lying evluation, V2 = Combined underlyings, V3
= Preprocessing stage. 'Replays' and 'Ï1 hit' rep­
resents local memory.
Time Speed-up Main Kernel Stats ¸%)
¸ms) ¸ Z ) Time Replays Ll Hit
V1 ö. 82 48. 38 01. ö8 10. 33 ö8. 73
V2 4. 70 ö9. 17 7ö. 70 10. 33 ö8. 73
V3 1. 43 197. 30 43. 3ö 0. 00 99. 97
V4 1. 3ö 207. 84 40. 33 0. 04 90. 77
Table ô: Marginal Results: VI = Separate underly­
ing evaluation, V2 = Combined underlyings, V3 =
Preprocessing stage, V 4 = Optimum thread block
size. 'Replays' and 'Ï1 hit' represents local mem­
ory.
the context of our eMS model, the underestimate is consid­
ered negligible since the fnal speed-up is sufciently close to
Amdahl's ¸2[theoretical maximum. Our CFUmodel targets
the integration, marginals and copula stages accounting for
91. ö%of the benchmark model runtime.
5.1 Optimisation examples and resul ts
Our analysis focuses primarily on the integration and marginals
stage as the copula stage ofered fewer avenues of optimisa­
tion due to its relative simplicity. The 'main kernel' or the
CFUfunction that dominates runtime is similar for both the
integration and marginals stage. The main kernel is respec­
tively used for the pricing of swaptions and the calculation of
marginal distributions, both of which require the evaluation
of SABR ¸9[ type formulae. The performance bottleneck
within our main kernel was the high number of arithmetic
instructions, this was determined through a profling analy­
sis and timing based on code removal.
Within the integration and marginals stage we identify a
grid of size t Z 2Z n, where t is the number of start dates, 2
is the number of underlyings and n is the size of integration
grid or the number of points within the discretised marginal
distribution F. Within the grid of t Z 2 Z n we observe
signifcant data parallelism and target this grid as a bae
for our CFU computation. For the integration stage we set
n = 82, more generally n is in the range ¸ö0, 100[ and for
the marginals stage we set n = ö12, more generally n is
in the range ¸2ö0,1000[. Our choice of n is based on error
considerations outside the scope of this paper. We briefy
describe a set of optimisation steps that were relevant in
the context of our model, corresponding results are shown
in Tables 4 and ö:
1. In our frst implementation ¸V1) we parallelise calcu­
lations on a grid of size t Z n and sequentially evaluate
each underlying. We use kernels of two sizes: Type I
kernels are of size t ¸ which launch on a single thread
00
-
~
7
2
~
j

40
O

S

- - - Occupancy 0.33
~Occupancy 0.5
20
0
I |
ö 10 1ö 20
o. of CFU Threads ¸ Z 10°)
Figure 11: An illustration of occupancy afecting
kernel execution time, when the number of passes
given by (10) is very low « 3).
block with t threads) and Type II kernels are of size
t Z n ¸which launch on t thread blocks each with n
threads) . Within the CF100 architecture we are lim­
ited to 1024 threads per block thus we must ensure
that n and tare: 1024.
2. In our second implementation ¸V2) we combine both
underlyings such that Type I kernels are now of size
t Z 2 ¸which launch on a single thread block with t Z 2
threads) and Type II kernels are of size t Z 2Z n ¸which
launch on t thread blocks each with n Z 2 threads) .
For Type I kernels we found the additional underlying
evaluation incurred no additional cost due to under­
utilisation of the processor, thus reducing Type I ker­
nel execution times by approximately half ¸assuming
sequential kernel launching) .
For the main kernel in the integration stage ¸a Type
II kernel) we found that a similar doubling of thread
block sizing ¸from 82 to 104 threads) led to changed
multiprocessor occupancy. As shown in Figure 11,
small changes in processor occupancy can amplify per­
formance diferences when the number of passes given
by ¸10) is small ¸this efect was also described in Sec­
tion 3) . As a consequence, we see in Table 0 a signif­
cant reduction in the execution time of our main ker­
nel. Type II kernels in the marginals stage were not
combined as this led to unfeasibly large thread block
sizes and loss of performance.
3. In our third implementation ¸V3) we undertook a pre­
processing step to simplify SABR type evaluations used
by the main kernels. Although formulae will vary
based on the particular SABR class model being em­
ployed, the inner-most code will typically involve sev­
eral interdependent low throughput transformations of
an underlying asset value or rate S, strike K, volatility
o and time to maturity T, in order to calculate a set
of expressions of the form ¸as found in the Black ¸3[
formula):
( In¸ SJK) +¸o`J2)T)
.
o
V
¸10)
Stage Benchmark Time Final Time Speed-up Kernels
Time ¸ms) ¸%) Time ¸ms) ¸%) ¸×) Launched
Integration 3ö2. 4ö 12. 30 7. 2ö 2. 00 48. ö9 4ö
Calibration 243. 74 8. ö1 243. 74 87. 34 1. 00 NJA
Marginals 281. 49 9. 82 1. 3ö 0. 49 207. 84 0
Copula 1,987. 40 09. 37 20. 73 9. ö8 74. 34 1
Overall 2,80ö. 08 100 279. 08 100 10. 27 ö2
Table 3: Benchmark and fnal performance results of the CMS spread option pricing model, where the number
of intervls or start dates t = 90.
VI V2
Total evaluations, I×2×! 1ö744 1ö744
Thread block size 82 104
Total blocks, 0 192 90
Occupancy 0. 313 0. 37ö
Possible active blocks, 1? ×^ _ 82 49
Passes needed by ¸10) 3 2
Measured time reduction NJA ÷32. 2%
Table ö: Time reduction of the integration stage
'main kernel' through changes in the number of
passes.
Within each grid of size !¡ parameter changes are strictly
driven by changes in a single variable K. Within our
optimisation eforts we therefore minimise our inter­
dependent transformations in the grid size t ×2×!
to a smaller preprocessing kernel with grid size t ×2.
Hence, assuming we were to evaluate ¸10),our prepro­
cessing kernel would calculate the terms ã = In¸ S) and
ß = uV. This would result in the idealised compu­
tational implementation of ¸10) as:
nozza1cd£´´ã - 1og´k))/ß + 0.ö-ß). ¸17)
As such the inner-most code conducts signifcantly fewer
low throughput operations resulting in a performance
gain. Such a preprocessing step can also be used to
improve CPU performance. Within the CPU imple­
mentation, the preprocessing approach increased the
amount of data needed by kernels from high latency
global memory. However since our main kernels have
arithmetic bottlenecks, the additional memory trans­
actions had little efect on computation time. This is
in contrast to numerous CPU algorithms which have
memory bottlenecks and thus additional memory trans­
actions are likely to afect computation time. As a re­
sult of our preprocessing we observed a large reduction
in the local memory replays and an increase in the !1
local hit ratio which we defne below:
High levels of complex computation can often result
in a single CPU thread running out of allocated regis­
ters used to store intermediate values. An overfow of
registers is assigned to the local memory space which
has a small L1 cache, misses to this cache results in
access to higher latency memory. Within our results
we therefore wish to maximise the !1 local hit ratio
- that is, the proportion of loads and stores to local
Time ¸ms) Speed-up ¸×)
Ver 1 4. ö0 430. 27
Ver 2 18. ö1 107. 39
Ver 3 12. 03 10ö. 22
Ver 4 20. 73 74. 34
Table 7: Copula Results: Ver 1 = Inverse normal
CDF only, Ver 2 = Ver 1 + normal CDF, Ver 3
Ver 1 + interpolation, Ver 4 = All components.
memory that reside within the L1 cache. However, if
the number of instructions associated to local mem­
ory cache misses is insignifcant in comparison to the
total number of instructions issued, we can somewhat
ignore the efect of a low L1local hit ratio. Therefore
we also measure the local memory replays - that is, the
number of local memory instructions that were caused
by misses to the L1cache as a percentage of the total
instructions issued.
4. In our fourth implementation ¸V4) of the marginals
stage we experimented with diferent sizes of thread
blocks, as a result we obtained a small reduction in
main kernel execution times ¸to 88% of V3) . Integra­
tion stage kernels did not beneft from further optimi­
sation.
As a result of our integration and marginals stage op­
timisation we observed speed-ups of 48. ö9× and 207. 84×
respectively against our benchmark implementation.
Within the copula stage we targeted a parallel grid con­
sisting of the number of simulations ^, which we launched
sequentially for each start date t. The sequential evaluation
was ]ustifed as ^ is a very large multiple of the total num­
ber of possible parallel threads per CPU. Within this stage
we were limited by two numerical tasks: ¸1) the evaluation
of the normal CDF and ¸2) a linear interpolation. The ex­
tent of these limitations is shown in Table 7 which presents
timing results for diferent versions which implement only
part of the algorithm. The evaluation of the normal CDF
dominated and was conducted by the ONORM algorithm
presented in Section 2. In regards to the interpolation, we
were unable to develop algorithms that suitably improved
performance. In particular we were unable to beneft from
hardware texture interpolation which is optimised for single
precision contexts. Our fnal copula implementation resulted
in a speed-up of 74. 3×. The fnal CPU based model obtained
an overall 10. 27× speed-up which is 87. 3% of Amdahl's ¸2¦
maximum theoretical speed-up of 11. 70×.
6. CONCLUSIONS
Calculation of the normal CDF through our proposed
ONORM algorithm is well suited to the GPU architecture.
ONORM exhibits comparable accuracy against the widely
adopted Cody algorithm whilst being faster, thus it is likely
to be the algorithm of choice for double precision GPU based
evaluation of the normal CDF. The algorithm can be further
improved by using the üV-ez£algorithm in the inner range.
Our parallel grid search implied volatility algorithm is ap­
plicable to GPUs when dealing with small numbers of im­
plied volatility evaluations. The algorithm is robust, guar­
antees a specifc accuracy and executes in a fxed time frame.
For larger groups of options, the algorithm is unsuitable as
computation time will grow linearly at a much faster rate
than GPU alternatives which use a single thread per im­
plied volatility calculation.
Within our GPU based CMS spread option model we
highlighted the importance of managing occupancy for ker­
nels with low pass ratios whilst also obtaining a particular
performance improvement through the use of preprocessing
kernels. In our experience industrial models do not prepro­
cess functions due to issues such as enabling maintenance
and reducing obfuscation, an idea which needs to be chal­
lenged for GPU performance.
Further work will consider calibration strategies, tradi­
tionally problematic on GPUs due to the sequential nature of
calibration algorithms ¸consisting of multi-dimensional opti­
misation) . Further work will also consider the wider perfor­
mance implications of GPU algorithms within large pricing
infrastructures found within the fnancial industry.
7. ACKNOWLEDGEMENTS
I acknowledge the assistance of Ian Eames, Graham Bar­
rett and anonymous reviewers. The study was supported by
the UK PhD Centre in Financial Computing, as part of the
Research Councils UK Digital Economy Programme, and
BNP Paribas.
8. REFERENCES
¸1[ A. Adams. Algorithm 39. Areas under the normal
curve. Computer Journal, 12¸2): 197-198, 1969.
¸2[ G. Amdahl. Validity of the single processor approach
to achieving large scale computing capabilities. In
AFIPS Conference Proceedings, pages 483-485, 1967.
¸3[ F. Black. The pricing of commodity contracts. Journal
of Financial Economics, 3¸ 1-2): 167-179, 1976.
¸4[ F. Black and M. Scholes. The pricing of options and
corporate liabilities. Journal of Political Economy,
81¸ 3) :637-654, 1973.
¸ö[ W. Cody. Rational Chebyshev approximations for the
error function. Mathematics of Computation,
23¸ 107) :631-637,1969.
¸0[ S. Galluccio and O. Scaillet. CMS spread products. In
R. Cont, editor, Encyclopedia of Quantitative Finance,
volume 1, pages 269-273. John Wiley õ Sons,
Chichester, UK, 2010.
¸7[ C. Gerald and P. Wheatley. Applied Numerical
Methods. Addison-Wesley, Reading, MA, 2004.
¸8[ P. Hagan. Convexity conundrums: pricing CMS swaps,
caps and foors. Wilmott Magazine, 4: 38-44, 2003.
¸9[ P. Hagan, D. Kumar, and A. Lisniewski. Managing
smile risk. Wilmott Magazine, 1: 84-108, 2002.
¸10[ J. Hart, E. Cheney, C. Lawson, H. Maehly,
C. Mesztenyi, J. Rice, H. Thacher Jr, and C. Witzgall.
Computer Approximations. John Wiley õ Sons, New
York, NY, 1968.
¸ 11 [ Ï. Hill. Algorithm AS 66: The normal integral. Applied
Statistics, 22¸ 3): 424-427, 1973.
¸12[ P. Jackel. By implication. Wilmott Magazine,
26:60-66, 2006.
¸13[ G. Marsaglia. Evaluating the normal distribution.
Journal of Statistical Software, 11¸5): 1-11, 2004.
¸ 14[ NVIDIA Corp. CUDA Toolkit 4. 1. ¸Online[Available:
bttp.//deve1opez.nv1d1a.coz/cuda-too1k1t-41,
2012.
¸1ö[ T. Ooura. Gamma Jerror functions. ¸Online|
A vailable: bttp.
//vvv.kuz1zs.kyoto-u.ac.]p/¯oouza/gazez£.btz1,
1996.
¸10[ W. Press, B. Flannery, S. Teukolsky, and
W. Vetterling. Numerical Recipes. Cambridge
University Press, New York, NY, 2007.
APPENDIX
A. LISTING OF THE ONORM ALGORITHM
/0 Based on the derf and derfc function of
Takuya ÜÜUhÅ (email: ooura�mmm.t.u-tokyo.ac.jp)
http://www.kurims.kyoto-u.ac.jp/-ooura/gamerf.html 0/
l
dev�ce inline double onorm(double x)
double t, y, u, w;
x o� 0.7071067811865475244008;
w = x ² Ü ¬Y ! X¦
if (w ² 1)
l
J
else
l
J
t = w * w;
y � ««««««5.958930743e-11 0 t + -1. 13739022964e-9)
o t + 1.46600519983ge-8) 0 t + -1.635035446196e-7) 0 t
+ 1.6461004480962e-6) 0 t 1 -1.492559551950604e-5) 0 t
+ 1.2055331122299265e-4) 0 t + -8.548326981129666e-4) 0
t + 0.00522397762482322257) 0 t + -0.0268661706450773342)
o t + 0.11283791670954881569) 0 t + -0.37612638903183748117)
o t + 1.12837916709551257377) 0 w;
y � 0.5 + 0.50y;
x = ¬X¦
t � 3.97886080735226 / (w + 3.97886080735226);
u � t - 0.5;
] �
«««««««««««0.00127109764952614092 * u +
1.19314022838340944e-4) * u - 0.003963850973605135) * u
- 8.70779635317295828e-4) * u + 0.00773672528313526668)
o u + 0.00383335126264887303) * u - 0.0127223813782122755)
o u - 0.0133823644533460069) * u + 0.0161315329733252248)
* U 1 0.0390976845588484035) * U 1 0.00249367200053503304)
o u - 0.0838864557023001992) * u - 0.119463959964325415)
o u + 0.0166207924969367356) * u + 0.357524274449531043)
o u + 0.805276408752910567) * u + 1.18902982909273333)
o u + 1.37040217682338167) * u + 1.31314653831023098) *
u + 1.07925515155856677) * u + 0.774368199119538609) *
u + 0.490165080585318424) * u + 0.275374741597376782) *
t ^ 0.5;
]
� ] * exp(-x * x);
x < Ü 1-y y;
J
return

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close