Credit Risk Evaluation of Online Personal Loan Applicants: a Data Mining approach

Published on April 2019 | Categories: Documents | Downloads: 16 | Comments: 0 | Views: 314
of 34
Download PDF   Embed   Report

Comments

Content

h oc T: 267.616.1444 F: 1.905.761.1006 9 Flagstaff Place Philadelphia, PA 19115 USA

t T: 647.271.1932 F: 905.761.1006 150 borrows Street Thornhill, ON L4J 2W8 Canada

info@isolutions.us

www.isolutions.us

www.bISolutions.us www.bISolutions .us | Data. Knowledge. Action.

WHITE PAPER: Credit risk evaluation of online Personal loan aPPliCants: a data Mining aPProaCh

SEPTEmbER 2008 Pavel brusilovskiy, brusilovskiy, business Intelligence Solutions David Johnson, Strategic Link Consulting

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

ic  This white paper is the result of joint work between Business Intelligence Solutions (BIS) and Strategic Link Consulting (SLC). Business Intelligence Solutions (www.bisolutions.us (www.bisolutions.us)) is a well established statistical/data mining/GIS company that conducts business for banking, finance, insurance and other industries. Our specialization is complex unstructured business problems for data rich firms. Our multidisciplinary team includes professionals in applied statistics, data mining, optimization and simulation, GIS, and software application development. The team members are authors of more than 100 published papers on diverse applications of data mining and other quantitative fields. BIS has access to the best statistical, visualization, data mining and GIS software on the world market. The essence of our approach is to understand and analyze our client’s client’s business problem and corresponding data through the prism of dissimilar statistical/data mining models. As a result, we are always able to produce the best possible model and help our clients in the most effective and scientifically sound way way.. Strategic Link Consulting (www.strategiclinkconsulting.com (www.strategiclinkconsulting.com)) represents multiple online personal loan clients within the sub-prime lending industry. industry. Loan amounts vary based on the customer’s customer’s income as a primary determinant of their ability to pay. Returning customers are eligible for larger loans with more stringent income requirements. The interest rate is a non-negotiable flat rate based on the duration of the loan. Returning customers are offered larger loans with lower fees. Payment schedules are derived from customer pay frequency (weekly, (weekly, bi-weekly, bi-weekly, semi-monthly or monthly). Customers may pay in full on their due date or refinance by paying either a portion of the principle or only the fee as allowed by applicable laws. Customers qualify for a loan after completing a waterfall of underwriting phases which consists of internal fraud and duplication checks, identity verification and external credit checks (not Trans Union, Equifax or Experian). These steps produce a score, similar to a FICO score, which determines if a customer is approved or denied based on adverse data components derived from their external data sources. The funding/origination of a loan is based on a verbal verification process that includes several manual steps including contacting the customer directly. directly.

Copy Co pyri righ ghtt © Pa Pave Pavel Pav vel ell br brus brusi bru usil silo ilov lovs ovsk vskiy skiy kiy iy,,–bus bbus b usin usin ines ines ess ess s sInt Intel Intel Int elli elli lige genc lige genc nce nce eeSol SSol S olut olut utio utio ions ions ns nsan and and an ddDav DDav D avid avid ididJo John Johns Jo hnso hnson son, on n, St Stra – rate Strat Str tegi ateg gic egic c Lin L icink Link Li knk Consu Con Cons Co sult nsul ltin ulti ing ting gng

[2]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

objc, g,  Pbm sm  As a rule, a lender must decide whether to grant credit to a new applicant. The methodology and techniques that provide the answer to this question is called credit scoring. This white paper is dedicated to the development of credit scoring models for online personal loans.  Taking into account the non-linearity of the relationship between overall customer risk and predictors, the  Taking primary objective is to develop a non-parametric and non-linear credit scoring model within data mining paradigm that will predict overall customer risk with maximum possible accuracy. This objective implies several goals: 1. Create a regression regression type credit scoring model model that predicts overall customer risk on a 100 point scale, using the binary assessment of customer risk  good  ( customer/   bad  customer). 2. Identify the importance importance of the predictors, predictors, and the drivers of being a good customer in order to separate good behavior from bad. 3. Develop the basis for a customer segmentation segmentation model that uses overall customer risk assessment to predict high (H), medium (M) and low (L) risk customers. 4. Show the fruitfulness fruitfulness of the synergy synergy of credit scoring modeling and Geographical Information Systems (GIS).  The outcome of the regression scoring model can be treated as the probability of being a good customer.  The segmentation rule depends on two positive thresholds h1 and h2, h2< h1<1. If for a given customer the probability of being a good customer is greater than  h1, where h1 is a large enough threshold (e.g., 0.75), then the customer belongs to the low risk segment. If, however, the probability of being a good customer is less than h1 but greater than h2 (e.g., h2=0.5), then the customer belongs to the medium risk segment. Finally, if the probability that the customer is a good customer is less than h2, he belongs to the high risk segment. The thresholds h1 and h2 should be provided by SLC, or their optimal values can be determined by BIS as a result of minimization of the corresponding cost matrix. Risk scoring is a tool that is widely used to evaluate the level of credit risk associated with a customer. While it does not identify “good” (no negative behavior) or “bad” (negative behavior expected) applicants on an individual basis, it provides the statistical odds, or probability, that an applicant with any given score will be “good” or “bad” (6, p.5). Scorecards are viewed as a tool for better decision making. There are two major types of scorecards: traditional and non-traditional. The first one, in its simplest form, consists of a group of “attributes” that are statistically significant in the separating good and bad customers. Each attribute is associated with some score, and the total score for applicant is the sum of the scores for each attribute present in the scorecard for that applicant.

Copy Co pyri righ ghtt © Pa Pave vell bru brusi silo lovs vskiy kiy – bus busin ines esss Int Intel elli lige genc nce e Sol Solut utio ions ns an and d Dav David id Jo Johns hnson on – Str Strat ateg egic ic Li Link nk Co Cons nsul ulti ting ng

[3]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 Traditional  T raditional scorecards have several advantages (6, p.26-27): • easytointerp easytointerpret( ret(ther thereisnor eisnorequir equirement ementforRisk forRiskManage Managerstokno rstoknowindep windepthstat thstatistics istics or data mining); • easyto easytoexplain explaintoacusto toacustomerwh merwhyanapp yanapplicati licationwasr onwasrejected ejected;; • scor scorecar ecarddevel ddevelopmen opmentpro tprocessistr cessistranspa ansparent rent(notabl (notablackbox ackbox)andisw )andiswidelyu idelyunders nderstood; tood; • scor scorecar ecardperfo dperforman rmanceiseasy ceiseasytoevalu toevaluateand ateandmonito monitorr.  The disadvantage of traditional scorecards is their accuracy accuracy.. As a rule, non-traditional scorecards (that can be represented as a data mining non-linear and non-parametric logistic regression) outperform traditional scorecards. Since each percent gained in credit assessment accuracy can lead to a huge savings,thisdisadvantageiscrucialforcreditscoringapplications.Moderntechnologyallowsustoeasily employ a very complex data mining scoring model to new applicants, and to dramatically reduce the misclassification rate for Good – Bad customers.  This white paper is dedicated to non-traditional scorecard development within a data mining paradigm.

Copy Co pyri righ ghtt © Pa Pave vell bru brusi silo lovs vskiy kiy – bus busin ines esss Int Intel elli lige genc nce e Sol Solut utio ions ns an and d Dav David id Jo Johns hnson on – Str Strat ateg egic ic Li Link nk Co Cons nsul ulti ting ng

[4]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

d sc  This study is based on the SLC sample of 5,000 customers, including 2,500 Good customers and 2,500 Bad customers, one record per customer. According to the rule of thumb (6, p. 28), one should have at least 2,000 bad and 2,000 good accounts within a defined time frame in order to get a chance to develop a good scorecard. Therefore, in principle, the given sample of accounts is suitable for scorecard development. Each record can be treated as a data point in a high-dimensional space (approximately 50 dimensions). In other words, each customer is characterized by 50 attributes (variables) that are differently scaled. The following variable types are present in the data:  age, average salary , credit score (industry specific credit • numeri numeric(i c(inte nterva rvalsc lscale aled)v d)vari ariabl abless essuch uchas as age bureau), etc; • catego categorical( rical(nomin nominal),w al),withasm ithasmallnum allnumberofca berofcategori tegoriessuch essuchas as periodicity  (reflects payroll frequency) with just 4 categories; customer’s s bank routing • catego categorical, rical,withala withalargenum rgenumberofc berofcategor ategories(e. ies(e.g., g.,employer name , customer’  number , e-mail domain, etc)  application date, employment date , due date, etc) • da date tev var aria iabl bles es( ( application

 The data also include a geographic variable (customer ZIP), and several customer identification variables such as customer ID, user ID, application number , etc. Unfortunately, the data does not include psychographic profiling variables.  There are several specific variables that we would like to mention: BV Completed  is a variable that answers whether the customer had a bank verification completed by the

loan processor. A value of 1 means the bank verification was completed. A missing value or 0 means it was not. Bank verification involves a 3 way call with the customer and their bank to confirm deposits, account status, etc. Score is an industry specific credit bureau score. Email Domain is a variable that reflects an ending part of the email address after the @ symbol.

 The variable Monthly means monthly income in dollars. Required Loan Amount is the principal amount of a loan at the time of origination. Credit Model  is a predictor that can take the following values:

• Newcustomer Newcustomerscore scorecard cards–ther s–thereareth earethreecr reecreditbu editbureau reauscore scorecard cardsthatex sthatexists,e ists,eachwit achwithmor hmore e stringent approval criteria. The baseline scorecard has only identity verification and an OFAC check while the tightest scorecard has a variety of criteria including inquiry restrictions, information about prior loan payment history, and fraud prevention rules. They are limited to standard loan amounts with standard fees, subject to meeting income requirements. • Retu Returningcu rningcustomer stomershavem shaveminimal inimalunder underwrit writingand ingandareeli areeligible gibleforpr forprogre ogressivel ssivelylarge ylargerloan rloan amounts with a fee below the standard fee for new customers.

Copy Co pyri righ ghtt © Pa Pave vell bru brusi silo lovs vskiy kiy – bus busin ines esss Int Intel elli lige genc nce e Sol Solut utio ions ns an and d Dav David id Jo Johns hnson on – Str Strat ateg egic ic Li Link nk Co Cons nsul ulti ting ng

[5]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Isoriginated is either 1 for originated loans or 0 for unoriginated loans. Withdrawn applications and denied

applications will have values of 0. Loans that were funded and had a payment attempt will have a value of 1. Loan Status is the status of the loan. Loan statuses are grouped as follows:

• Ddesig Ddesignates natestheclass theclassofGood ofGoodCustom Customers(alo ers(aloanispai anispaidoffsu doffsuccessf ccessfullyw ullywithnor ithnoreturn) eturn). . • P,R,B, ,R,B,andCde andCdesignat signatethecla etheclassofBad ssofBadCustom Customers. ers. Other variable names are self-explanatory. self-explanatory.  The available variables can be classified according to their role in the model development process. In statistical terms, variables can be dependent or independent, but in data mining, a dependent variable is called a target, and an independent variable is called an input (or predictor). It makes sense to consider a two-segment analysis of risk. In two-segment analysis, the target is a binary variable Risk (Good, Bad ), ), or Risk Indicator (1, 0), where 1 corresponds to a Good customer (Risk = Good) and 0 corresponds to a Bad customer (Risk = Bad).  As we mentioned before, each target is associated with a unique optimal regression type model. The outcome of each model can be treated as the corresponding probability of target = 1 which, in turn, can be interpreted as a credit score on a 100 point scale. In other words, the model under consideration serves to estimate probability/cred probability/creditit score of being a Good customer customer..

Copy Co pyri righ ghtt © Pa Pave vell bru brusi silo lovs vskiy kiy – bus busin ines esss Int Intel elli lige genc nce e Sol Solut utio ions ns an and d Dav David id Jo Johns hnson on – Str Strat ateg egic ic Li Link nk Co Cons nsul ulti ting ng

[6]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

expy d ay  d Ppc Exploratory Data Analysis (EDA) and data preprocessing are time consuming but necessary steps of any data analysis and modeling project, and data mining is no exception (see, for example, 9). All major data mining algorithms are computationally intensive, and data preprocessing can significantly improve the quality of the model.  The objectives of EDA include understanding the data better, evaluating the feasibility and accuracy of  overall customer risk assessment, estimating the predictability of Good/Bad customers, and identifying the best modeling methodology of credit scoring modeling, and in particular, customer segmentation with High,Medium,andLowrisk. SLC data preprocessing might include reduction of the number of categories, creation of new variables, treatment of missing values, etc. For example, in the categorical variable Application Source , the first four characters indicate the market source. When the Market Source variable was constructed, the frequency of each category was calculated (second column in boxes of Graph 1). It turned out that this variable has 45 distinct values, but only 18 categories are large. The rest of the categories were grouped into a new category, OTHR, and the frequency distribution of the modified variable is presented in the left box of  Graph 1.  This example demonstrates the necessity of these preliminary steps: it turns out that the constructed variable Market Source Grouped  is selected as an important predictor, predictor, whereas the original Market Source variable is not. Graph1a.VariableTransformation/Grouping:MarketSourcevariable

Missing 1000 2WAL     BSDE CLBP COLM CRED CRSC CRTD CRUE CRUF CSPD CSUP CTAR CVCL CWST DCSM ECOM EDEN EFOR FRND

542 4 2 ACTM ADVA 56 2 1 66 56 2 70 55 1 508 4 14 3 44 11 8 9 86

Original Values ILON IMPL LDCL LDCM LDFN LDPL LDPT LDRV LEAD LFLA MKGN MNTZ MTDR NETM PART PDHO SUNS SWIS TFSP

2 2

45 Distinct  Values

   

Grouped Values 4 11 70 4 9 79 128 99 1 1748 35 172 80 2 583 4 1 344 2 VEND XMLL

25 15

BSDE CRED CRSC CRUE CRUF CRVF CSUP DCSM FRND LDCL LDPL LDPT LDRV LFLA

56 66 56 70 55 36 508 44 86 70 79 128 99 1748

MISS

542

MKGN MNTZ

35 172

OTHR

223

PART SWIS

583 344

20 Distinct  Values

Copy Co pyri righ ghtt © Pa Pave vell bru brusi silo lovs vskiy kiy – bus busin ines esss Int Intel elli lige genc nce e Sol Solut utio ions ns an and d Dav David id Jo Johns hnson on – Str Strat ateg egic ic Li Link nk Co Cons nsul ulti ting ng

[7]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 Another problem with the data is the misspelling and/or double name of one and the same category for some categorical variables. In particular, particular, the variable Email Domain has a lot of errors in the correct spelling of a domain. For instance, there are 5 different spelling versions of yahoo.com: Email Domain

Number of Customers

yaho.com yahoo.com

2 2023

yhaoo.com Yahoo.com  YAHOO.COM  YAOO.COM

3 6 402 1

and 7 different versions of the domain sbcglobal.net: Email Domain

Number of Customers

sbcglobal.ne

1

sbcglobal.net

194

sbcgloblal.net

1

sbcgolbal.net

1

sbclobal.net

1

SBCGLOBA.NET

1

SBCGLOBAL.NET

33

In order to produce meaningful results, all misspellings should be corrected.  According to our intuition, the variable Score is the most important to correctly predict the probability of  being a good customer. The first thing that can be done is discriminating between customers, using just the Score predictor. Graph 1b shows that it is not easy to do manually.

Copy Co pyri righ ghtt © Pa Pave vell bru brusi silo lovs vskiy kiy – bus busin ines esss Int Intel elli lige genc nce e Sol Solut utio ions ns an and d Dav David id Jo Johns hnson on – Str Strat ateg egic ic Li Link nk Co Cons nsul ulti ting ng

[8]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph 1b.Distribution of the variable Score for Bad and Good customers Risk Bad

Good

700

600

500

     y      c      n      e      u      q      e      r        F

400

300

200

100

0 0

200

400

600

800

1000

0

200

400

600

800

1000

Score

Constructionofadditionalvariablescandramaticallyimprovetheaccuracyofriskprediction.Newtime duration variables orig_duration = Origination Date – Application Date emp_duration = Origination Date – Employment Date due_duration = Loan Due Date – Origination Date

serve as examples of new variable creation. For the sake of illustrating the importance of data preprocessing, we can mention here that the latter two of these three variables were important predictors selectedbytheTreeNetalgorithm. In order to better understand the relationship between several interval scaled (continuous) variables, quite often a special visualization tool (a matrix plot) is used. The matrix plot (Graph 2) was developed for the following four variables: Requested Loan Amount, Finance Charge, Score, and Applicant Age for both segments: Good and Bad customers.  There is no obvious difference in the relationship of any pair of variables between two segments of  customers (Good /Bad).

Copy Co pyri righ ghtt © Pa Pave vell bru brusi silo lovs vskiy kiy – bus busin ines esss Int Intel elli lige genc nce e Sol Solut utio ions ns an and d Dav David id Jo Johns hnson on – Str Strat ateg egic ic Li Link nk Co Cons nsul ulti ting ng

[9]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph2.MatrixPlotsandHistograms

Risk = Bad

Risk = Good

 The correlation structure of interval scaled variables can be different for different segments. In order to check this hypothesis, let us select all interval scaled variables and estimate the non-parametric correlation coefficient (Spearman correlation) for each pair of the following 8 variables: Required Loan  Amount, Financial Charge, Average Salary, Salary, Score, Applicant Age, and three duration variables that are defined below. The Spearman correlation is used in the situation when a pair of variables under consideration is not subject to bivariate normal distribution.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 10 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph3.Non-parametriccorrelationanalysis,basedonSpearmancorrelation

Risk = Bad

Risk = Good

 There are no obvious differences in correlation structure among 8 numeric variables. In other words, the correlation structure is similar among predictors for Good and Bad segments.  The complexity of SLC data can be characterized by: • • • • •

Highdimen Highdi mensio sional nality ity(ab (about out50p 50pre redic dictor tors) s) Unchar Unc haract acteri erizab zablen lenonon-lin linear eariti ities es Presenceof Pre senceofdiff differen erentlyscal tlyscaledpr edpredict edictors(nu ors(numerica mericandcate ndcategorica gorical)l) Missin Mis singva gvalue luesfo sforso rsomep mepre redic dictor torss Largeperce Large percentage ntageofcatego ofcategoricalp ricalpredi redictorsw ctorswithext ithextreme remelylarg lylargenumb enumbersofca ersofcategori tegoriesand esand extremely non-uniform frequency distributions • Non Non-no -norma rmalit lityof yofnu numer mericp icpre redic dictor tors. s.  Therefore, complex sophisticated methods should be employed to separate good and bad accounts in the SLC data.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 11 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

My Data and problem specificity limit the number of algorithms that can be used for SCL data analysis.  Any traditional parametric regression modeling approach (such as statistical logistic regression) and any traditionalnonparametricregression(suchasLowess,GeneralizedAdditiveModels,etc.)areinadequate for such problems. The main reason for this is the presence of a large number of categorical variables with huge numbers of categories. The inclusion of such categorical information in a multidimensional dataset imposes a serious challenge to the way researchers analyze data (10).  Any approach based on linear, integer or non-linear programming (see, for example, 7, Chapter 5), is also not the best approach for the same reasons. Within the data mining universe, only some algorithms can be applicable to SLC data. For example, data mining cluster analysis algorithms available in some of the best data mining software (SAS Enterprise MinerandSPSSClementine)arebasedonEuclideandistanceandcannotbeusedforthesamereasons as above. On the other hand, preliminary analyses and modeling that we had conducted have shown that the accuracyofthenonlinear,nonparametricregressiontypemodelsgeneratedbytheTreeNetandRandom Forest algorithms are acceptable. We should note that the use of each of the applicable methods implies that the original data are randomly separated into two parts: the first is for training (model development) and the second is for validation of  the model. Validation is the process of testing the developed model on unseen data. SLC describes possible findings in the analysis by the following example: “Customers that are 29 years old, live in Pennsylvania, and make less than $2,000 per month have an 88% chance of default.” This is a typical representation of a CART / CHAID type regression tree algorithm findings. The set of CART /  CHAID type rules can easily be applied to unseen data, and can be embedded into the SLC loan credit risk evaluation online system. Unfortunately, it is quite possible that the best credit scoring model will not be a CART / CHAID type of  regressiontreemodel.NonparametricandnonlinearTreeNetorRandomForesttyperegressionmodels as a rule outperform CART / CHAID type of models on data similar to SLC data. If this is the case, a simple representation of the best model as a set of simple rules as mentioned above is impossible. We can gain the accuracy of risk prediction, but lose the simplicity of model/finding representation. If this tradeoff between prediction accuracy and model representation simplicity is to be resolved in favor ofaccuracy,thentheprojectshouldincludethedevelopmentofa.Netcomponentimplementingthebest  TreeNetorRandomForestscoringmodelthatcouldberunindependentlyonSalfordSystemssoftware and integrated into the SLC loan credit risk evaluation online system. Inaddition,standarddataminingtoolssuchasSPSSClementine,SASEnterpriseMiner,andSalford Systems have between 20 and 100 model parameter options that need to be specified by the researcher.  The settings that would produce the best model could only be found through extensive systematic experimentation by a data mining expert and lead to the optimal model. For the SLC business problem, Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 12 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

this would mean a significantly reduced number of misclassified customers (i.e., customers with a wrongly estimated credit risk level). However, the search for the optimal model is a combination of art and science, and again, requires experience and expertise in the data mining field.

scc g B (tn) ow Stochastic gradient boosting was invented in 1999 by Stanford University Professor Jerome Friedman (1, 2). Salford Systems - a California based data mining software development company (http://www. (http://www. salford-systems.com)hasimplementedandcommercializedthisinventionasaTreeNetproductin salford-systems.com )hasimplementedandcommercializedthisinventionasaTreeNetproductin 2002.TheTreeNetwasthefirststochasticgradientboostingtoolintheworlddataminingindustry.The intensiveresearchhasshownthatTreeNetmodelsareamongthemostaccurateofanyknownmodeling techniques.TreeNetisalsoknownasMultipleAdditiveRegressionTrees(MART).  TheTreeNetmodelisanon-parametricregressionandcanbedescribedasalinearcombinationofsmall trees (3, 4, 5): Predicted Target =

 AO + B1  x T 1 (X) + B2  x T 2 (X) + B3 x T 3 (X) + ... + B N  x T   N (X). Here the first term  AO is a model starting point; as a rule, it is the median of a target. The idea of the algorithm is the following. The residuals are calculated as the difference between AO and the reality.  Then the residuals are transformed in order to reduce the impact of outliers (Huber’ (Huber’ss adjustment for outliers). The transformed residuals are called pseudo-residuals. The first tree T 1 (X ) is fitted to the pseudo-residuals, and the coefficient B1 is determined. After that the new pseudo-residuals, the difference between predicted values of a target (employing the model AO + B1  x T 1 (X ) ) and reality, are calculated, and the second tree T 1 (X ) is fitted to the new pseudo-residuals. This process is repeated, and the final predicted value of the target is formed by adding the weighted contribution of each tree with the corresponding weights  B1, B2, ... B N .TheTreeNetalgorithmtypicallygenerateshundredsoreven thousands of small trees. This sequential error-correcting process converges to an accurate model that is highly resistant to outliers and misclassified data points.  TheTreeNetalgorithm • isrelative isrelativelyimpe lyimperviou rvioustoerr stoerrorsinth orsinthedepen edependentv dentvariab ariable(tar le(target),s get),suchasmi uchasmislabe slabeling ling • isstr isstrongly onglyresis resistantto tanttooverfi overfitting( tting(pred predicting ictingnoisein noiseinstead steadofpred ofpredicting ictingsignal signal) ) • gen genera eraliz lizesw eswell elltou tounse nseend endata ata. .

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 13 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 TreeNetisthemostflexibleandpowerfuldataminingtool,capableofgeneratingextremelyaccurate models for both regression and classification and can work with varying sizes of data sets (from small to huge) while readily managing a large number of columns (http://www (http://www.salford-systems.com/tr .salford-systems.com/treenet.php eenet.php). ).  The algorithm can handle both continuous and categorical targets and predictors, and readily handle any number of irrelevant predictors. FromnowonmajordataminingsoftwaredeveloperssuchasMegaputerIntelligence http://www.megaputer.com/ andSAS(EnterpriseMinerVersion5.3)http://www.sas.com/  andSAS(EnterpriseMinerVersion5.3)http://www.sas.com/ includeTreeNet includeTreeNet type algorithms in the suite of available tools.  TreeNetmodelsareusuallycomplex,consistingofhundreds(oreventhousands)oftrees,andrequire special efforts to understand and interpret the results. The software generates a number of special reports with visualization to extract the meaning of the model, such as a ranking of predictors according to their importance on a 100 point scale, and graphs of the relationship between inputs and target. InordertounderstandgraphsofreportsforbinarytargetsthatTreeNetgenerates,weneedtoremind ourselves of the concepts of Odds and Log Odds.

l o, o   Pbby   e  The odds of an event (for example, first payment default) is defined as the ratio of the probability that an event occurs to the probability that it fails to occur. Thus, Odds(Event) = Pr(Event) / [1 - Pr(Event)]  The log odds are just the natural logarithm of the odds: ln(Odds). People quite often use the concept of odds to express the likelihood of an event. When you hear someone say that the odds are 3-to-1, it means that the probability of an event occurring is three times greater than the probability of the event not occurring. The shorter way of saying the same: the odds equal 3 (which implies that the odds are 3-to-1). In other words, the odds are 3 means that the probability of the event is .75 and the probability of non-event is .25, i.e., 3-to-1. If the odds are 1-to-3, we could also say that the odds are .3333. The probability of the event is .25 and the probability of non-event is .75.  Another example: saying the odds are 3-to-2 is the equivalent of saying that the odds are 1.5-to-1 or just 1.5, for short. The probability of the event is .6. For an inverse situation the odds are 2-to-3, we could say the odds are .6667 and that the probability is .4. When the odds are 1-to-1, or just 1 for short, the probability of the event is .5.  According to the definition, both odds and log odds are the monotonically increasing function of an event probability (See Graph 4 and Graph 5).

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 14 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

1. If the probability probability of an event is 0.5, then odds odds are equal to 1, and log odds are equal equal to 0. 2. If the probability of an event is 0, then odds are equal to 0 too, and log odds are equal to minus infinity. 3. If the probability of an event is 1, then odds are plus infinity, infinity, and log odds are plus infinity as well. Graph 4. Relationship between Odds and Probability of an event

50 45 40 35

  s    d    d    O

30 25 20 15 10 5 0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Probability

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 15 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph 5. Relationship between Log Odds and Probability of an event

4 3 2

  s    d    d    O   g   o    L

1 0 -1 -2 -3 -4 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Probability

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 16 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

treenet risk assessMent Models tn t n ay For the purpose of our analysis we randomly selected the data sample into two subsamples. 60% of the samplebuildsthefirstsubsample,theLEARNdata.Itwillonlybeusedformodelestimation.Thesecond subsample, the TEST data, will be used to estimate model quality.  TheTreeNetalgorithmhasabout20differentoptionsthatcanbecontrolledbyaresearcher.Asarule, usage of default options does not produce the best model. Determination of the best options/optimal model is time consuming and requires experience and expertise. Modelsthatarequitedifferent(seeFirstandSecondmodelsbelow)canhavesimilaraccuracy,andthe interpretability criterion should be used to select the best model.

f tn r am m  The target is a binary variable Risk with two possible values: Good  and Bad . The Good  value of the target was selected as a focus event. All predictors are listed in the first column of Table 1. The second column reflects an importance score on a 100 point scale with the highest score of 100 corresponding to the most important predictor. If the score equals 0, the predictor is unimportant at all for the target. The third column, Variance Importance, just visualizes the second column, Score.  This particular model is based on just 8 predictors, but has a risk prediction error of about 14% on learningdata,andariskpredictionerrorofabout19%onvalidationdata.IftheTreeNetalgorithmdidnot select the Score variable, it means that within this model the variable Score is not important. It does not mean that the credit score is superfluous or irrelevant in customer credit risk assessment. It just means that the useful information provided by the variable Score is covered by 8 important predictors, selected byTreeNet(seeTable1).

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 17 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 Table1.TreeNetmodelvariableimportance Variable

Score

Variable Importance

Bank_Names$

100.00

|||||||||||||||||||||||||||||||||||||||

Merch_Store_ID

86.74

||||||||||||||||||||||||||||||||||

Email_Domains

46.16

||||||||||||||||||

Market_Source_GRPD$

26.06

||||||||||

BV_Completed

23.14

|||||||||

Fin_Charge

9.22

|||

Due_Duration

6.35

||

Emp_Duration

5.37

|

Type_of_Payroll$

0.00

Merchant_Nmbr$

0.00

Credit_Model$

0.00

Periodicity$

0.00

Appramt$

0.00

Req_Loan_Amt

0.00

Appl_Status$

0.00

Avg_Salary

0.00

Courtesy_Days

0.00

 Aba_No

0.00

Score

0.00

Monthly

0.00

Age

0.00

Orig_Duration

0.00

Cust_Acct_Type$

0.00

Isoriginated

0.00

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 18 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 Table2.TreeNetmodelmisclassificationrate TreeNet Misclassification for Learn Data Class

N Cases

N Mis-Classe

Pct Error

Cost

Bad

1.496

207

13.04

207.00

Good

1,509

198

13.12

198.00

TreeNet Misclassification for Test Data Class Bad Good

N Cases

N Mis-Classe

Pct Error

Cost

1,004

200

19.92

200.00

991

165

16.65

165.00

Graph 6. Impact of  Market Source Grouped  predictor on the probability of being a good customer: Risk = Good, controlling for all other predictors. One Predictor Dependence for Risk$ = Good

0.06

  e   c   n   e    d   n   e   p   e    D    l   a    i    t   r   a    P

0.05 0.04 0.03 0.02 0.01 0.00 -0.01 -0.02 -0.03

     D      E      R      C

     C      S      R      C

     E      U      R      C

     F      U      R      C

     F      V      R      C

     P      L      S      C

     M      S      C      D

     D      N      R      F

     L      O      D      L

     L      P      D      L

     T      P      D      L

     V      R      D      L

     A      L      F      L

     S      S      I      M

     N      G      K      M

     Z      T      N      M

     R      H      T      O

     T      R      A      P

     S      M      S

Market Source GRPD$

 The Y-axis is a log odds of the event Risk = Good. Therefore, 0 corresponds to the situation when odds are 1-to-1, or the probability of an event equals the probability of a non-event. In other words, the X-axis corresponds to the base line that reflects an equal chance to be a good or bad customer.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 19 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 The impact of Market Source Grouped  is significant and varies across different values. All values. All values of the Market Source Grouped  variable with bars above the X-axis increase the probability of being a good customer, and all bars that are below the X-axis decrease the probability of being a good customer. We can say that the value of LDPT has the highest positive impact on the probability of  being a good customer, and the value of CRSC has the highest negative impact on the probability of being a good customer.  Table 3. Frequency of Loan Status by Market Source Grouped  Frequency

CRUE

CRUF

LDPT

MISS

Total

C

8

15

31

76

130

D

60

40

92

411

603

P

2

0

5

55

62

Total

70

55

128

12542

795

 The left column of Table 3 depicts the values of the Loan Status predictor, and the upper row of the table depicts the values of the Market Source Grouped  predictor. Table 3 pictured the frequency of customers that have the following values of the Market Source Grouped variable:CRUE,CRUF,LDPT,andMISS. variable:CRUE,CRUF,LDPT,andMISS.  These values are matched to the tallest positive bars on Graph 6 (the values with the highest positive impact on the probability of being a good customer). Since the value of D of the Loan Status predictor designates a Good customer, and the values C and P correspond to a Bad customer (see Data Structure section), we can infer that there is a good agreement between the model (Graph 6) and the data (Table 3).  There is a significant difference between the information presented in Table 3 and in Graph 6. If we forget for a minute about the existence of all other predictors, and consider just two of them (Loan Status and Market Source Groupe ) using available data, then we can arrive at the conclusion that the majority of  customers with Market Source Grouped valuesofCRUE,CRUF,LDPT valuesofCRUE,CRUF,LDPT,andMISSareGoodcustomers. ,andMISSareGoodcustomers.  Again, we considered the join frequency distribution of only these two predictors, and disregarded the impact of all other predictors. In other words, there is no control for other predictors at all, and it is data induced information.  Theinformation,representedbyGraph6,onthecontrary,wasproducedbythedevelopedTreeNet model, and it is model induced information. The relationship between target (probability of being a good customer) and Market Source Grouped  was mapped, controlling for all other predictors.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 20 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph7.TreeNetModeling:ImpactofBV Completed  and Market Source Grouped  predictors on Probability of Being a Good Customer (controlling for all other predictors).  Variable Dependence for Risk$; Slice Market Source GRPD$ = BSDE (1) Risk$ = Good 0.06 0.05

  e   c   n   e    d   n   e   p   e    D    l   a    i    t   r   a    P

0.04 0.03 0.02 0.01

Market Source = BSDE

0.00 0

1

BV Completed

Two Variable Dependence for Risk$; Slice Market Source GRPD$ = CRSC (3) Risk$ = Good BV Completed   e   c   n   e    d   n   e   p   e    D    l   a    i    t   r   a    P

0

1

0.02 0.01 0.00

Market Source = CRSC

-0.01 -0.02 -0.03 -0.04

Graph 7 represents an example of the non-linear interaction between BV Completed  and Market Source predictors: for different values of one predictor, the impact on the probability of being a good customer has different directions. Actually, Actually, for the value of Market Source Grouped = BSDE  both values of BV  Completed predictor have a positive impact on the probability of being a good customer. On the other hand, for the value of Market Source Grouped  =CRSC, the value 0 of the BV Completed  predictor accords with negative impact, but the value 1 accords with a positive impact on the probability of being a good customer.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 21 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

sc tn r am m  AsintheFirstTreeNetmodelconstruction,60%ofthedataarerandomlyselectedtobeusedformodel development (learning), and the remaining 40% of data used for model validation (holdout observations, or test data).  This particular model is based on 17 predictors, and has a risk prediction error of about 9% on learning data, and a risk prediction error of about 20% on validation data.  Table4.TreeNetmodelvariableimportance Variable

Score

Variable Importance

Bank_Names$

100.00

Email_Domains

47.27

Credit_Model$

28.41

Market_Source_GRPD$

24.94

BV_Completed

10.06

Merchant_Nmbr$

8.57

||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||| |||||||||||| ||||||||||| ||| |||

Age

8.34

|||

Score

8.21

|||

Orig_Duration

7.33

||

Emp_Duration

7.33

||

Appramt$

6.90

||

Monthly

6.69

||

Due_Duration

6.47

||

Courtesy_Days

6.24

||

Cust_Zip

5.36

|

Avg_Salary

4.84

|

Fin_Charge

4.09

|

Merch_Store_ID

3.46

Req_Loan_Amt

2.89

Periodicity$

1.77

State_Code$

1.36

Type_of_Payroll$

1.25

 Aba_No

0.00

Isoriginated

0.00

Cust_Acct_Type$

0.00

Appl_Status$

0.00

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 22 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 Table5.TreeNetmodelmisclassificationrate TreeNet Misclassification for Learn Data Class

N Cases

N Mis-Classe

Pct Error

Cost

Bad

1.496

143

9.56

143.00

Good

1,509

109

7.22

109.00

TreeNet Misclassification for Test Data Class Bad Good

N Cases

N Mis-Classe

Pct Error

Cost

1,004

205

20.42

200.00

991

217

21.90

217.00

Graph8.TreeNetModeling:ImpactofEmail Domain predictor on Probability of Being a Good Customer, Customer, controllingforallotherpredictors(SecondModel) One Predictor Dependence for Risk$ = Good Email Domains

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 23 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 The impact of Email Domain is extremely significant, but has different directions for different values. We can mention several distinctive segments of Email Domain values with different impacts on the probability of being a good customer: 1. 2. 3. 4. 5.

Extremely positive impact Modest Mod estpos positi itivei veimpa mpact ct Practically no impact Modest Mod estneg negati ativei veimpa mpact ct Extremely negative impact

Graph9.TreeNetModeling:ImpactofCredit Model  predictor on Probability of Being a Good Customer, Customer, controllingforallotherpredictors(SecondModel).  Variable Dependence for Risk$; Slice Market Source GRPD$ = BSDE (1) Risk$ = Good 0.06 0.05 0.04 0.03   e   c   n   e    d   n   e   p   e    D    l   a    i    t   r   a    P

0.02 0.01 0.00 -0.01 -0.02 -0.03 -0.04 -0.05

0001

0002

0003

7777

8888

Credit Models

 The only value 0001 of Credit Model  is associated with a strong negative impact on the Probability of  being a Good customer. The value of 0003 is associated with the strongest positive impact on Probability of being a Good customer.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 24 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 Table 6. Frequency of Loan Status by Credit Model  Frequency

0001

0002

0003

7777

8080

Total

C

295

116

13

232

193

849

D

331

500

325

712

632

2500

P

1552

99

0

0

0

1651

Total

2178

715

338

944

825

5000

 The left column of Table 6 depicts values of the Loan Status predictor (D designates a class of Good customers, and C and P designate a class of Bad customers), and the upper row of the table depicts the values of the Credit Model  predictor (see section Data Structure for meaning of Credit Model  values). The data supports the directions and strength (size) of impact on the probability of being a Good customer, induced by the model (Graph 9). Graph10.TreeNetModeling:ImpactofMerchant Number  predictor on Probability of Being a Good Customer,controllingforallotherpredictors(SecondModel)

One Predictor Dependence for Risk$ = Good

Merchant Numbers 0.01

  e   c   n   e    d   n   e   p   e    D    l   a    i    t   r   a    P

57201

57202

57203

57204

57206

57501

0.00

-0.01

-0.02

-0.03

-0.04

-0.05

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 25 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 Table 7. Frequency of Loan Status by Merchant Number  Frequency

57201

57206

Total

C

43

99

142

D

0

116

116

P

9

162

171

Total

52

377

429

 The left column of Table 7 depicts values of the Loan Status predictor (symbol D corresponds to a Good customer, and symbols C and P correspond to a Bad customer), and the upper row of the table depicts the values of the Merchant Number  predictor predictor.. Table Table 7 pictured the frequency of customers that have the following values of Merchant Number  variable: 57201 and 57206. We can observe that the majority of  customers with these values of Merchant Number  belong to the segment of Bad customers. Again, the model induced knowledge (Graph 10) and the data are in a good agreement.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 26 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph11.TreeNetModeling:ImpactofScore predictor on Probability of Being a Good Customer, Customer, controllingforallotherpredictors(SecondModel) One Predictor Dependence for Risk$ = Good 0.012 0.010 0.008   e   c   n   e    d   n   e   p   e    D    l   a    i    t   r   a    P

0.006 0.004 0.002 0.000 200

300

400

500

600

700

800

900

1000

-0.002 -0.004 -0.006

Score

Credit bureau score (Score predictor) has a binary impact on the probability of being a good customer: if the score is 600 and higher, then the probability jumps up, and if an applicant score is less than 600, then the probability jumps down, but this negative jump is not very large. In other words, for scores of  less than 600 the impact on the probability of being a good customer is very modest.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 27 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph12.TreeNetModeling:ImpactofEmployment Duration and Score on Probability of Being a Good Customer,controllingforallotherpredictors(SecondModel) One Predictor Dependence for Risk$ = Good 0.012

0.010

0.008

  e   c   n   e    d   n   e   p   e    D    l   a    i    t   r   a    P

0.006

0.004

0.002

0.000 1000

2000

3000

4000

5000

6000

7000

8000

9000

-0.002

-0.004

-0.006

EMP Duration

If Employment Duration is less than 1,500 days, we can say that there is no strong impact on the probability of being a Good customer (we can treat the part of Graph 12 curve for Employment Duration is less than 1,500 days as a noise). The real impact of Employment Duration starts from 1,500 days, and is linear up to 5,000 days. Then the probability of being a Good customer has a diminishing returns effect when Employment Duration becomes greater than 5000 days.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 28 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

 Age on Probability of Being a Good Customer, Graph13.TreeNetModeling:Impactof Age Customer, controllingforallotherpredictors(SecondModel). One Predictor Dependence for Risk$ = Good 0.002

  e   c   n   e    d   n   e   p   e    D    l   a    i    t   r   a    P

0.000

20

30

40

50

60

70

-0.002

-0.004

 Age

It turned out that the best probability of being a Good customer occurs with applicants between ages 38 and 42 years. If the age is less than 32, then the impact is negative, and the younger an applicant the lower the probability of being a good customer. On the other hand, the strength of positive impact on the probability of being a Good customer goes down when age is increasing.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 29 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph14.TreeNetModeling:ImpactoftheinteractionofBV Completed  and Credit Model  predictors on ProbabilityofBeingaGoodCustomer,controllingforallotherpredictors(SecondModel) Two Predictor Dependence for Risk$ = Good

 The vertical axis maps log odds of the event Risk = Good. Two other axes are matched by values of the BV Completed  and Credit Model  predictors. The combination of BV Completed  =1 and Credit Model  = 7777 has the strongest positive impact on the probability of being a Good customer. On the other hand, the combination of BV Completed  = 0 and Credit Model  = 0001 has the strongest negative impact on the probability at hand.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 30 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

gis appc In order to improve the quality of the data mining predictive models, it is useful to enrich SLC’s data with additional region-level demographic, socioeconomic and housing variables that can be obtained from the US Bureau of the Census. These variables include • • • •

median medi anho hous useh ehol oldi dinc ncom ome e education medi me dian ang grros oss srren entt medi me dian anho hous usev eval alue ue,e ,etc tc..

 The variables can be obtained at different geographic levels, namely at the ZIP Code and the Census block levels. Because the SLC data include ZIP code as one of the variables, it will be possible to merge the Zip level Census data to the SLC data directly. However, if customer address data are available, it will be advisable to obtain the Census data for smaller geographic regions (namely, (namely, Census blocks). Because Census blocks are in general much smaller than Zip codes, the Census estimates for these areas will be much more precise and much more applicable than for their ZIP code counterparts. Using the Geographic Information Systems software, customer addresses can be geocoded (i.e. the latitude and longitude of the addresses can be determined, and the addresses can be mapped). Then, it will be possible to spatially match the addresses to their respective Census blocks (and Census block data). Demographic, socioeconomic, and housing data can then be obtained at the Census Block level.  Although geocoding is a time intensive procedure, enriching the SLC data with the Census block level data will make the accuracy of the credit score even higher. Employing dissimilar data mining tools, it is easy to determine which Census variables are crucial for customer risk assessment. When corresponding data become available, maps produced by the GIS will enable us to visually identify zip codes with many bad (high) risk customers and zip codes with many good (low) risk customers (Graph 15 and Graph 16).

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 31 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph 15. Percent of Bad Customers in Each ZIP Code

Legend: % of clients for whom Risk = bad 0 1 - 37 38 - 57 58 - 83 84 - 100 No Data Available

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 32 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

Graph16.NumberofBadCustomersinEachZIPCode

Legend: % of clients for whom Risk = bad 0-1 2-3 4-6 7 - 11 12 - 18 No Data Available

Cc KnowledgegeneratedbytheTreeNetmodelsisingoodcompliancewiththedataandreadily interpretable.TheaccuracyofTreeNetmodelsissuperior interpretable.TheaccuracyofT reeNetmodelsissuperior.T .TreeNetisanappropriatetoolfornonreeNetisanappropriatetoolfornontraditionalscorecarddevelopment,usingSLCdata.TreeNetmodelinducedknowledgeisagreatasset for traditional scorecard development.

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 33 ]

business Intelligence Solutions – White Paper Credit Risk Evaluation of Online Personal Loan Applicants: A Data mining Approach www.isolutions.us | Data. Knowledge. Action.

rc 1. J. Friedman (1999), GreedyFunctionApproximation:AGradientBoostingMachin GreedyFunctionApproximation:AGradientBoostingMachine e http://www.salford-systems.com/doc/G http://www .salford-systems.com/doc/GreedyFuncApp reedyFuncApproxSS.pdf  roxSS.pdf  2. J. Friedman (1999), Stochastic Gradient Boosting http://www.salford-systems.com/doc/S http://www .salford-systems.com/doc/StochasticBoostingSS.pdf  tochasticBoostingSS.pdf  3. TreeNetFrequentlyAskedQuestions http://www.salford-systems.com/doc/TreeNetFAQ.pd f  4. DanSteinberg(2006),OverviewofT DanSteinberg(2006),OverviewofTreeNetT reeNetTechnology echnology.StochasticGradientBoosting .StochasticGradientBoosting http://perseo.dcaa.unam.mx/sistemas/doctos/TN_overview.pdf  5. Boosting Trees Trees for Regression and Classification, StatSoft Electronic Electronic Text Text Book http://www.statsoft.com/textbook/stbootres.html 6. Na Naee eemS mSid iddi diqi qi (2005), Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring (Wiley and SAS Business Series), Wiley, 208 p. 7. Thomas,L.C.,Edelman,D.B.,Crook Thomas,L.C.,Edelman,D.B.,Crook,J.N,(2002),CreditScoringanditsApplications,SIAM,250p. ,J.N,(2002),CreditScoringanditsApplications,SIAM,250p. 8. Matigno Matignon n,R.(2007).DataMiningUsingSASEnterpriseMiner.WileyPublishing. 9. Myatt,G.J.(2006).MakingSenseofData: APracticalGuidetoExploratoryDataAnalysisandData Mining.WileyPublishing. 10. Seo, J. and Gordish-Dressman, H. (2007). Exploratory Data Analysis With Categorical  Variables: An Improved Rank-by-Feature Framework and a Case Study . International Journal of Human-Computer Interaction. Available online at http://www.informaworld.com/smpp/  title~content=t775653655

Copy Co pyri righ ghtt © Pav Pavel el bru brusi silo lovs vskiy kiy – busi busine ness ss Int Intel elli lige genc nce e Solu Soluti tion onss and and Davi David d Johns Johnson on – Str Strat ateg egic ic Lin Link k Cons Consul ulti ting ng

[ 34 ]

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close