Predictive Modeling Using Decision Trees

Published on May 2016 | Categories: Types, Presentations | Downloads: 59 | Comments: 0 | Views: 321
of 76
Download PDF   Embed   Report

Comments

Content

Chapter 2 Predictive Modeling Using Decision Trees
2.1Introduction to Enterprise Miner.............................................................2 2.2Modeling Issues and Data Difficulties..................................................19 2.3Introduction to Decision Trees.............................................................35 2.4Building and Interpreting Decision Trees............................................44

2

Chapter 2 Predictive Modeling Using Decision Trees

2.1 Introduction to Enterprise Miner

Objectives
 

 

Open Enterprise Miner. Explore the workspace components of Enterprise Miner. et !p a pro"ect in Enterprise Miner. Cond!ct initial data exploration !sing Enterprise Miner.

2

2.# $!ilding and %nterpreting Decision Trees

3

Introduction to Enterprise Miner

Opening The Enterprise Miner 1. Start a SAS session. Double-click on the SAS icon on your desktop or select Start  Programs  The SAS System  The SAS System for Windows V8. 2. To start Enterprise Miner, type miner in the command box or select Solutions  Analysis  Enterprise Miner. Setting Up the Initial Pro e!t and "iagram 1. Select #ile  $ew  Pro e!t%. 2. Modi y the location o the pro!ect older i desired by selectin" &rowse%. #. Type the name o the pro!ect $ or example, My Project%.

&. 'heck the box or 'lient(ser)er pro!ect i needed. Do not check this box unless instructed to do so by the instructor.



*ou must ha)e access to a ser)er runnin" the same )ersion o Enterprise Miner. This allo+s you to access databases on a remote host or distribute the data-intensi)e processin" to a more po+er ul remote host. , you create a client(ser)er pro!ect, you +ill be prompted to pro)ide a ser)er pro ile and to choose a location or iles created by Enterprise Miner.

4

Chapter 2 Predictive Modeling Using Decision Trees

-. Select 'reate. The pro!ect opens +ith an initial untitled dia"ram. .. 'lick on the dia"ram title and type a ne+ title i desired $ or example, Project1%.

2.# $!ilding and %nterpreting Decision Trees

5

Identifying the Wor(spa!e 'omponents 1. /bser)e that the pro!ect +indo+ opens +ith the Dia"rams tab acti)ated. Select the Tools tab located to the ri"ht o the Dia"rams tab in the lo+er-le t portion o the pro!ect +indo+. This tab enables you to see all o the tools $or nodes% that are a)ailable in Enterprise Miner.

Many o the commonly used tools are sho+n on the toolbar at the top o the +indo+. , you +ant additional tools on this toolbar, you can dra" them rom the +indo+ abo)e onto the toolbar. ,n addition, you can rearran"e the tools on the toolbar by dra""in" each tool to the desired location on the bar. 2. Select the )eports tab located to the ri"ht o the Tools tab. This tab re)eals any reports that ha)e been "enerated or this pro!ect. This is a ne+ pro!ect, so no reports are currently a)ailable. The open space on the ri"ht is your dia"ram +orkspace. This is +here you "raphically build, order, and se0uence the nodes you use to mine your data and "enerate reports.

6

Chapter 2 Predictive Modeling Using Decision Trees

The Scenario
 



Determine who sho!ld &e approved for a home e'!it( loan. The target varia&le is a &inar( varia&le that indicates whether an applicant event!all( defa!lted on the loan. The inp!t varia&les are varia&les s!ch as the amo!nt of the loan) amo!nt d!e on the existing mortgage) the val!e of the propert() and the n!m&er of recent credit in'!iries.

4

The consumer credit department o a bank +ants to automate the decision-makin" process or appro)al o home e0uity lines o credit. To do this, they +ill ollo+ the recommendations o the E0ual 'redit /pportunity Act to create an empirically deri)ed and statistically sound credit scorin" model. The model +ill be based on data collected rom recent applicants "ranted credit throu"h the current process o loan under+ritin". The model +ill be built rom predicti)e modelin" tools, but the created model must be su iciently interpretable so as to pro)ide a reason or any ad)erse actions $re!ections%. The 1ME2 data set contains baseline and loan per ormance in ormation or -,3.4 recent home e0uity loans. The tar"et $5AD% is a binary )ariable that indicates i an applicant e)entually de aulted or +as seriously delin0uent. This ad)erse outcome occurred in 1,163 cases $247%. 8or each applicant, 12 input )ariables +ere recorded.

2.# $!ilding and %nterpreting Decision Trees

7

$ame 5AD :EAS/;

Model )ole Tar"et ,nput

Measurement *e+el 5inary 5inary

"es!ription 19de aulted on loan, 49paid back loan 1ome,mp9home impro)ement, Debt'on9debt consolidation Six occupational cate"ories Amount o loan re0uest Amount due on existin" mort"a"e ?alue o current property Debt-to-income ratio *ears at present !ob ;umber o ma!or dero"atory reports ;umber o trade lines ;umber o delin0uent trade lines A"e o oldest trade line in months ;umber o recent credit in0uiries

</5 =/A; M/:TD>E ?A=>E DE5T,;' */< DE:/@ '=;/ DE=,;2 '=A@E ;,;2

,nput ,nput ,nput ,nput ,nput ,nput ,nput ,nput ,nput ,nput ,nput

;ominal ,nter)al ,nter)al ,nter)al ,nter)al ,nter)al ,nter)al ,nter)al ,nter)al ,nter)al ,nter)al

The credit scorin" model computes a probability o a "i)en loan applicant de aultin" on loan repayment. A threshold is selected such that all applicants +hose probability o de ault is in excess o the threshold are recommended or re!ection.

Chapter 2 Predictive Modeling Using Decision Trees

!ro"ect #etup and Initial Data E$ploration

Using SAS *i,raries To identi y a SAS data library, you assi"n it a library re erence name, or libref. Ahen you open Enterprise Miner, se)eral libraries are automatically assi"ned and can be seen in the Explorer +indo+. 1. Double-click on the =ibraries icon in the Explorer +indo+.

To de ine a ne+ libraryB 2. :i"ht-click in the Explorer +indo+ and select $ew.

2.# $!ilding and %nterpreting Decision Trees

9

#. ,n the ;e+ =ibrary +indo+, type a name or the ne+ library. 8or example, type ':SSAMC. &. Type in the path name or select &rowse to choose the older to be connected +ith the ne+ library name. 8or example, the chosen older mi"ht be located at 'BD+orkshopDsasDdmem. -. , you +ant this library name to be connected +ith this older e)ery time you open SAS, select Ena,le at startup.

.. Select O-. The ne+ library is no+ assi"ned and can be seen in the Explorer +indo+.

1%

Chapter 2 Predictive Modeling Using Decision Trees

E. To )ie+ the data sets that are included in the ne+ library, double-click on the icon or 'rssamp.

&uilding the Initial #low 1. Cresumin" that the dia"ram Cro!ect1 in the pro!ect named My Cro!ect is open, add an ,nput Data Source node by dra""in" the node rom the toolbar or rom the Tools tab to the dia"ram +orkspace. 2. Add a Multiplot node to the +orkspace to the ri"ht o the ,nput Data Source node. *our dia"ram should appear as sho+n belo+.

/bser)e that the Multiplot node is selected $as indicated by the dotted line around it%, but the ,nput Data Source node is not selected. , you click in any open space on the +orkspace, all nodes become deselected. ,n addition to dra""in" a node onto the +orkspace, there are t+o other +ays to add a node to the lo+. *ou can ri"ht-click in the +orkspace +here you +ant the node to be placed and select Add node rom the pop-up menu, or you can double-click +here you +ant the node to be placed. ,n either case, a list o nodes appears, enablin" you to select the desired node. The shape o the cursor chan"es dependin" on +here it is positioned. The beha)ior o the mouse commands depends on the shape as +ell as the selection state o the node o)er +hich the cursor is positioned. :i"ht-click in an open area to see the menu. The last three menu items $'onnect items, Mo)e items, Mo)e and 'onnect% enable you to modi y the +ays in +hich the cursor can be used. Mo)e and 'onnect is selected by de ault, and it is hi"hly recommended that you do not chan"e this settin". , your

2.# $!ilding and %nterpreting Decision Trees

11

cursor is not per ormin" a desired task, check this menu to make sure that Mo)e and 'onnect is selected. This selection allo+s you to mo)e the nodes around the +orkspace as +ell as connect them. /bser)e that +hen you put your cursor in the middle o a node, the cursor appears as a hand. To mo)e the nodes around the +orkspaceB 1. Cosition the cursor in the middle o the node until the hand appears. 2. Cress the le t mouse button and dra" the node to the desired location. #. :elease the le t mouse button. To connect the t+o nodes in the +orkspaceB 1. Ensure that the ,nput Data Source node is deselected. ,t is much easier to dra" a line +hen the node is deselected. , the be"innin" node is selected, click in an open area o the +orkspace to deselect it. 2. Cosition the cursor on the ed"e o the icon representin" the ,nput Data Source node $until the crosshair appears%. #. Cress the le t mouse button and immediately be"in to dra" in the direction o the Multiplot node. , you do not be"in dra""in" immediately a ter pressin" the le t mouse button, you select only the node. Dra""in" a selected node "enerally results in mo)in" the nodeF that is, no line orms. &. :elease the mouse button a ter reachin" the ed"e o the icon that represents the endin" node. -. 'lick a+ay rom the line and the inished arro+ orms as sho+n belo+. ,nitial Appearance 8inal Appearance

12

Chapter 2 Predictive Modeling Using Decision Trees

Identifying the Input "ata This example uses the 1ME2 data set in the ':SSAMC library. 1. To speci y the input data, double-click on the Input "ata Sour!e node or ri"htclick on this node and select Open%. The Data tab is acti)e. *our +indo+ should appear as ollo+sB

2. 'lick on Sele!t% to select the data set. Alternati)ely, you can enter the name o the data set. #. The SAS>SE: library is selected by de ault. To )ie+ data sets in the ':SSAMC library, click on the and select ')SSAMP rom the list o de ined libraries.

2.# $!ilding and %nterpreting Decision Trees

13

&. Select .ME/ rom the list o data sets in the ':SSAMC library and then select O-. The dialo" sho+n belo+ opens.

/bser)e that this data set has -,3.4 obser)ations $ro+s% and 1# )ariables $columns%. ':SSAMC.1ME2 is listed as the source data. *ou could ha)e typed in this name in the ield instead o selectin" it throu"h the dialo". ;ote that the lo+er-ri"ht corner indicates a metadata sample o siGe 2,444. All analysis packa"es must determine ho+ to use )ariables in the analysis. Enterprise Miner utiliGes metadata in order to make a preliminary assessment o ho+ to use each )ariable. 5y de ault, it takes a random sample o 2,444 obser)ations rom the data set o interest and uses this in ormation to assi"n a model role and a measurement le)el to each )ariable. To take a lar"er sample, you can select the 'hange% button in the lo+er-ri"ht corner o the dialo". 1o+e)er, that is not sho+n here. 1. 'lick on the Varia,les tab to see all o the )ariables and their respecti)e assi"nments. 2. 'lick on the irst column headin", labeled ;ame, to sort the )ariables by their name. *ou can see all o the )ariables i you enlar"e the +indo+. The ollo+in" table sho+s a portion o the in ormation or each o the 1# )ariables.

14

Chapter 2 Predictive Modeling Using Decision Trees

/bser)e that t+o o the columns are "rayed out. These columns represent in ormation rom the SAS data set that cannot be chan"ed in this node. Type is either character $char% or numeric $num%, and it a ects ho+ a )ariable can be used. The )alue or Type and the number o le)els in the metadata sample o 2,444 is used to identi y the model role and measurement le)el. The irst )ariable is 5AD, +hich is the tar"et )ariable. Althou"h 5AD is a numeric )ariable in the data set, Enterprise Miner identi ies it as a binary )ariable because it has only t+o distinct nonmissin" le)els in the metadata sample. The model role or all binary )ariables is set to input by de ault. *ou need to chan"e the model role or 5AD to tar"et be ore per ormin" the analysis. The next i)e )ariables $'=A@E throu"h DE:/@% ha)e the measurement le)el interval because they are numeric )ariables in the SAS data set and ha)e more than 14 distinct le)els in the metadata sample. The model role or all interval )ariables is set to input by de ault. The )ariables </5 and :EAS/; are both character )ariables in the data set, but they ha)e di erent measurement le)els. :EAS/; is binary because it has only t+o distinct nonmissin" le)els in the metadata sample. The model role or </5, ho+e)er, is nominal because it is a character )ariable +ith more than t+o le)els. 8or the purpose o this analysis, treat the remainin" )ariables as inter)al )ariables.



At times, )ariables such as DE:/@ and DE=,;2 +ill be assi"ned the model role o ordinal. A )ariable is listed as ordinal +hen it is a numeric )ariable +ith more than t+o but no more than ten distinct nonmissin" le)els in the metadata sample. This o ten occurs +ith countin" )ariables, such as a )ariable or the number o children. 5ecause this assi"nment depends on the metadata sample, the measurement le)el o DE:/@ or DE=,;2 or your analysis mi"ht be set to ordinal. All ordinal )ariables are set to ha)e the input model roleF ho+e)er, you treat these )ariables as inter)al inputs or the purpose o this analysis.

2.# $!ilding and %nterpreting Decision Trees

15

Identifying Target Varia,les 5AD is the response )ariables or this analysis. 'han"e the model role or 5AD to target. To modi y the model role in ormation, proceed as ollo+sB 1. Cosition the tip o your cursor o)er the ro+ or 5AD in the Model :ole column and ri"ht-click. 2. Select Set Model )ole  target rom the pop-up menu. Inspe!ting "istri,utions *ou can inspect the distribution o )alues in the metadata sample or each o the )ariables. To )ie+ the distribution o 5ADB 1. Cosition the tip o your cursor o)er the )ariable 5AD in the ;ame column. 2. :i"ht-click and obser)e that you can sort by name, ind a name, or )ie+ the distribution o 5AD. #. Select View "istri,ution of &A" to see the distribution o )alues or 5AD in the metadata sample.

To obtain additional in ormation, select the the ?ie+ ,n o tool, , rom the toolbar at the top o the +indo+ and click on one o the bars. Enterprise Miner displays the le)el and the proportion o obser)ations represented by the bar. These plots pro)ide an initial o)er)ie+ o the data. 8or this example, approximately 247 o the obser)ations +ere loans +here the client de aulted. 5ecause the plots are based on the metadata sample, they may )ary sli"htly due to the di erences in the sampled obser)ations, but the bar or 5AD91 should represent approximately 247 o the data. 'lose the ?ariable 1isto"ram +indo+ +hen you are inished inspectin" the plot. *ou can e)aluate the distribution o other )ariables as desired. Modifying Varia,le Information Ensure that the remainin" )ariables ha)e the correct model role and measurement le)el in ormation. , necessary, chan"e the measurement le)el or DE:/@ and DE=,;2 to interval. To modi y the measurement le)el in ormationB

16

Chapter 2 Predictive Modeling Using Decision Trees

1. Cosition the tip o your cursor o)er the ro+ or DE:/@ in the measurement column and ri"ht-click. 2. Select Set Measurement  inter+al rom the pop-up menu. #. :epeat steps 1 and 2 or DE=,;2. Alternati)ely, you can update the measurement le)el in ormation or both )ariables at the same time by hi"hli"htin" the ro+s or DE:/@ and DE=,;2 simultaneously be ore ollo+in" steps 1 and 2 abo)e. In+estigating "es!ripti+e Statisti!s The metadata is used to compute descripti)e statistics. Select the Inter+al Varia,les tab.

,n)esti"ate the minimum )alue, maximum )alue, mean, standard de)iation, percenta"e o missin" obser)ations, ske+ness, and kurtosis or inter)al )ariables. 5ased on business kno+led"e o the data, inspectin" the minimum and maximum )alues indicates no unusual )alues. /bser)e that DE5T,;' has a hi"h percenta"e o missin" )alues $217%. Select the 'lass Varia,les tab.

,n)esti"ate the number o le)els, percenta"e o missin" )alues, and the sort order o each )ariable. /bser)e that the sort order or 5AD is descendin", +hereas the sort order or all the others is ascendin". This occurs because you ha)e a binary tar"et

2.# $!ilding and %nterpreting Decision Trees

17

e)ent. ,t is common to code a binary tar"et +ith a 1 +hen the e)ent occurs and a 4 other+ise. Sortin" in descendin" order makes le)el 1 the irst le)el, +hich is the tar"et e)ent or a binary )ariable. ,t is use ul to sort other similarly coded binary )ariables in descendin" order or interpretin" parameter estimates in a re"ression model. 'lose the ,nput Data Source node, sa)in" chan"es +hen prompted. Additional "ata E0ploration /ther tools a)ailable in Enterprise Miner enable you to explore your data urther. /ne such tool is the Multiplot node. The Multiplot node creates a series o histo"rams and bar charts that enable you to examine the relationships bet+een the input )ariables and the binary tar"et )ariable. 1. :i"ht-click on the Multiplot node and select )un. 2. Ahen prompted, select 1es to )ie+ the results. 5y usin" the Ca"e Do+n button on your keyboard, you can )ie+ the histo"rams "enerated or this data.

8rom this histo"ram, you can see that many o the de aulted loans +ere by homeo+ners +ith either a hi"h debt-to-income ratio or an unkno+n debt-to-income ratio.



Ahen you open a pro!ect dia"ram in Enterprise Miner, a lock is placed on the dia"ram to a)oid the possibility o more than one person tryin" to chan"e the dia"ram at the same time. , Enterprise Miner or SAS terminates abnormally,

1

Chapter 2 Predictive Modeling Using Decision Trees

the lock iles are not deleted and the lock remains on the dia"ram. , this occurs, you must delete the lock ile to "ain access to the dia"ram. To delete a lock ileB 1. :i"ht-click on the pro!ect name in the dia"rams tab o the +orkspace and select E0plore%. 2. ,n the toolbar o the explorer +indo+ that opens, click on #. ,n the Search or iles or olders named ield, type *.lck. &. Select . .

-. /nce the lock ile has been located, ri"ht-click on the ilename and select "elete. This deletes the lock ile and makes the pro!ect accessible a"ain.

2.# $!ilding and %nterpreting Decision Trees

19

2.2 Modeling Issues and Data Difficulties

Objectives
 

Disc!ss data diffic!lties inherent in data mining. Examine common pitfalls in model &!ilding.

7

Time Line
Projected: Actual: Dreaded: (Data Acquisition) Needed: Data Preparation Data Analysis Allotted Time

,t is o ten remarked that data preparation takes 347 o the e ort or a "i)en pro!ect. The truth is that the modelin" process could bene it rom more e ort than is usually "i)en to it, but a ter a "ruelin" data preparation phase there is o ten not enou"h time le t to spend on re inin" the prediction models. The irst step in data preparation is data ac0uisition, +here the rele)ant data is identi ied, accessed, and retrie)ed rom )arious sourcesF con)ertedF and then consolidated. ,n many cases, the data ac0uisition step takes so lon" that there is little time le t or other preparation tasks such as cleanin". A data +arehouse speeds up the data ac0uisition step. A data warehouse is a consolidation and inte"ration o databases desi"ned or in ormation access. The source data usually comes rom a transaction-update system stored in operational databases.

2%

Chapter 2 Predictive Modeling Using Decision Trees

Data Arrangement
Acct type 2133 2133 2133 2653 2653 3544 3544 3544 3544 3544
9

Long-Narrow Short-Wide
Acct CK SVG MMF CD LOC MTG 2133 2653 3544 1 1 1 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1

MTG SVG CK CK SVG MTG CK MMF CD LOC

The data o ten must be manipulated into a structure suitable or a particular analysis-by-so t+are combination. 8or example, should this bankin" data be arran"ed +ith multiple ro+s or each account-product combination or +ith a sin"le ro+ or each account and multiple columns or each productH

Derived Inputs
Claim Date 11nov96 22dec95 26ap 95 02!"#94 Accident Time 102396/12:38 012395/01:42 042395/03:05 0$0294/06:25

Delay Season Dark 19 333 3 0 69 186 4 &a## '(n)e *p (n+ *"%%e '(n)e *"%%e &a## 0 1 1 0 0 0 1

08%a 96 123095/18:33 15dec96 061296/18:12 09nov94 110594/22:14

1%

The )ariables rele)ant to the analysis rarely come pre abricated +ith opportunistic data. They must be created. 8or example, the date that an auto accident took place and insurance claim +as iled mi"ht not be use ul predictors o raud. Deri)ed )ariables such as the time bet+een the t+o e)ents mi"ht be more use ul.

2.# $!ilding and %nterpreting Decision Trees

21

Roll Up
HH 4461 4461 4461 4461 4461 4911 5630 5630 6225 6225
11

Acct 2133 2244 2$$3 2653 2801 3544 2496 2635 4244 4165

Sales 160 42 212 250 122 $86 458 328 2$ $59 HH 4461 4911 5630 6225 Acct 2133 3544 2496 4244 Sales , , , ,

Marketin" strate"ies o ten dictate rollin" up accounts rom a sin"le household into a sin"le record $case%. This process usually in)ol)es creatin" ne+ summary data. 1o+ should the sales i"ures or multiple accounts in a household be summariGedH >sin" the sum, the mean, the )ariance, or all threeH

Rolling Up Longitudinal Data
Frequent Flier 10621 10621 10621 10621 33855 33855 33855 33855
12

Mont# -an Fe/ Ma 0p -an Fe/ Ma 0p

Flyin Milea e 650 0 0 250 350 300 1200 850

V!" Mem$er .o .o .o .o .o .o 1e* 1e*

,n some situations it may be necessary to roll up lon"itudinal data into a sin"le record or each indi)idual. 8or example, suppose an airline +ants to build a prediction model to tar"et current re0uent liers or a membership o er in the I?ery ,mportant Cassen"erJ club. /ne record per passen"er is needed or super)ised classi ication. 1o+ should the lyin" milea"e be consolidated i it is to be used as a predictor o club membershipH

22

Chapter 2 Predictive Modeling Using Decision Trees

Hard Target Search

Transactions

Fraud

13

The lack o a tar"et )ariable is a common example o opportunistic data not ha)in" the capacity to meet the ob!ecti)es. 8or instance, a utility company may ha)e terabytes o customer usa"e data and a desire to detect raud, but it does not kno+ +hich cases are raudulent. The data is abundant, but none o it is super)ised. Another example +ould be healthcare data +here the outcome o interest is pro"ress o some condition across time, but only a tiny raction o the patients +ere e)aluated at more than one time point. ,n direct marketin", i customer history and demo"raphics are a)ailable but there is no in ormation on response to a particular solicitation o interest, a test mailin" is o ten used to obtain super)ised data. Ahen the data does not ha)e the capacity to sol)e the problem, the problem needs to be re ormulated. 8or example, there are unsuper)ised approaches to detectin" anomalous data that mi"ht be use ul or in)esti"atin" possible raud. ,nitial data examination and analysis does not al+ays limit the scope o the analysis. @ettin" ac0uainted +ith the data and examinin" summary statistics o ten inspires more sophisticated 0uestions than +ere ori"inally posed.

2.# $!ilding and %nterpreting Decision Trees

23

Oversampling

OK

Fraud
14

,nstead o the lack o a tar"et )ariable, at times there are )ery rare tar"et classes $credit card raud, response to direct mail, and so on%. A strati ied samplin" strate"y use ul in those situations is choice-based sampling $also kno+n as case-control samplin"%. ,n choice-based samplin" $Scott and Aild 136.%, the data are strati ied on the tar"et and a sample is taken rom each strata so that the rare tar"et class +ill be more represented in the trainin" set. The model is then built on this biased trainin" set. The e ects o the input )ariables on the tar"et are o ten estimated +ith more precision +ith the choice-based sample e)en +hen a smaller o)erall sample siGe is taken, compared to a random sample. The results usually must be ad!usted to correct or the o)ersamplin". ,n assessin" ho+ much data is a)ailable or data minin", the rarity o the tar"et e)ent must be considered. , there are 12 million transactions, but only -44 are raudulent, ho+ much data is thereH Some +ould ar"ue that the e ecti)e sample siGe or predicti)e modelin" is much closer to -44 than to 12 million.

24

Chapter 2 Predictive Modeling Using Decision Trees

Undercoverage
Next Generation

Accepted Good

Accepted Bad

Rejected No Follow-up

15

The data used to build the model o ten does not represent the true tar"et population. 8or example, in credit scorin", in ormation is collected on all applicants. Some are re!ected based on the current criterion. The e)entual outcome $"ood(bad% or the re!ected applicants is not kno+n. , a prediction model is built usin" only the accepted applicants, the results may be distorted +hen used to score uture applicants. >nderco)era"e o the population continues +hen a ne+ model is built usin" data rom accepted applicants o the current model. 'redit-scorin" models ha)e pro)en use ul despite this limitation. Reject inference re ers to attempts to include the re!ected applicants in the analysis. There are se)eral ad hoc approaches, all o +hich are o 0uestionable )alue $1and 133E%. The best approach $ rom a data analysis standpoint% is to ac0uire outcome data on the re!ected applicants by either extendin" credit to some o them or by purchasin" ollo+-up in ormation on the ones +ho +ere "i)en credit by other companies.

2.# $!ilding and %nterpreting Decision Trees

25

Errors Outliers and !issings
ckin 1 1 1 3 1 1 1 1 1
16

%ckin 1 1 1 2 2 1 1 2 2 1 3 2

AD&

'SF dirdep 1 0 0 0 0 2 0 1 2 0 0 0 18$6 0 6 0 $218 1256 0 0 1598 0 0 $218

SVG 1 1 1 1 1 1 1

$al 1208 0 0 4301 234 238 0 1208 0 0 45662 234

468211 682$5 212204 2 585205 44$269 468$2$ 2 2 0200 89981212 585205

Are there any suspicious )alues in the abo)e dataH ,nade0uate data scrutiny is a common o)ersi"ht. Errors, outliers, and missin" )alues must be detected, in)esti"ated, and corrected $i possible%. The basic data scrutiny tools are ra+ listin"s, i (then subsettin" unctions, exploratory "raphics, and descripti)e statistics such as re0uency counts and minimum and maximum )alues. Detection o such errors as impossible )alues, impossible combinations o )alues, inconsistent codin", codin" mistakes, and repeated records re0uire persistence, creati)ity, and domain kno+led"e. /utliers are anomalous data )alues. They may or may not be errors $like+ise errors may or may not be outliers%. 8urthermore, outliers may or may not be in luential on the analysis. Missin" )alues can be caused by a )ariety o reasons. They o ten represent unkno+n but kno+able in ormation. Structural missin" data represent )alues that lo"ically could not ha)e a )alue. Missin" )alues are o ten coded in di erent +ays and sometimes miscoded as Geros. The reasons or the codin" and the consistency o the codin" must be in)esti"ated.

26

Chapter 2 Predictive Modeling Using Decision Trees

!issing "alue Imputation
Inputs ? ? ? Cases ? ? ?
17

?

? ?

T+o analysis strate"ies are 1. complete-case analysis. >se only the cases that ha)e complete records in the analysis. , the Imissin"nessJ is related to the inputs or to the tar"et, then i"norin" missin" )alues can bias the results. ,n data minin", the chie disad)anta"e +ith this strate"y is practical. E)en a smatterin" o missin" )alues in a hi"h dimensional data set can cause a disastrous reduction in data. ,n the abo)e example, only 3 o the 1&& )alues $..2-7% are missin", but a complete-case analysis +ould only use & casesKa third o the data set. 2. imputation. 8ill in the missin" )alues +ith some reasona,le )alue. :un the analysis on the ull $ illed-in% data. The simplest types o imputation methods ill in the missin" )alues +ith the mean $mode or cate"orical )ariables% o the complete cases. This method can be re ined by usin" the mean +ithin homo"enous "roups o the data. The missin" )alues o cate"orical )ariables could be treated as a separate cate"ory. 8or example, type o residence mi"ht be coded as o+n home, buyin" home, rents home, rents apartment, li)es +ith parents, mobile home, and unkno+n. This method +ould be pre erable i the missin"ness is itsel a predictor o the tar"et.

2.# $!ilding and %nterpreting Decision Trees

27

The #urse o$ Dimensionalit%

1–D

2–D –D
1

The dimension o a problem re ers to the number o input )ariables $actually, de"rees o reedom%. Data minin" problems are o ten massi)e in both the number o cases and the dimension. The curse of dimensionality re ers to the exponential increase in data re0uired to densely populate space as the dimension increases. 8or example, the ei"ht points ill the one-dimensional space but become more separated as the dimension increases. ,n 144-dimensional space, they +ould be like distant "alaxies. The curse o dimensionality limits our practical ability to it a lexible model to noisy data $real data% +hen there are a lar"e number o input )ariables. A densely populated input space is re0uired to it hi"hly complex models. ,n assessin" ho+ much data is a)ailable or data minin", the dimension o the problem must be considered.

2

Chapter 2 Predictive Modeling Using Decision Trees

Dimension Reduction
!edundanc"
#($ar%et)
Inp ut

Irrele&anc"

Input

Input1

1

Inp

ut 2

19

:educin" the number o inputs is the ob)ious +ay to th+art the curse o dimensionality. >n ortunately, reducin" the dimension is also an easy +ay to disre"ard important in ormation. The t+o principal reasons or eliminatin" a )ariable are redundancy and irrele)ancy. A redundant input does not "i)e any ne+ in ormation that has not already been explained. >nsuper)ised methods such as principal components, actor analysis, and )ariable clusterin" are use ul or indin" lo+er dimensional spaces o nonredundant in ormation. An irrele)ant input is not use ul in explainin" )ariation in the tar"et. ,nteractions and partial associations make irrele)ancy more di icult to detect than redundancy. ,t is o ten use ul to irst eliminate redundant dimensions and then tackle irrele)ancy. Modern multi)ariate methods such as neural net+orks and decision trees ha)e builtin mechanisms or dimension reduction.

2.# $!ilding and %nterpreting Decision Trees

29

&ool's (old

My model its t!e trainin" data per ectly###

I’ve struck it rich!

Testing the procedure on the data that gave it birth is almost certain to overestimate performance, for the optimizing process that chose it from among many possible procedures will have made the greatest use of any and all idiosyncrasies of those particular data.
K Mosteller and Tukey $13EE%

3%

Chapter 2 Predictive Modeling Using Decision Trees

Data Splitting

21

,n data minin", the standard strate"y or honest assessment o "eneraliGation is data splitting. A portion is used or ittin" the modelKthe trainin" data set. The rest is held out or empirical )alidation. The validation data set is used or monitorin" and tunin" the model to impro)e its "eneraliGation. The tunin" process usually in)ol)es selectin" amon" models o di erent types and complexities. The tunin" process optimiGes the selected model on the )alidation data. 'onse0uently, a urther holdout sample is needed or a inal, unbiased assessment. The test data set has only one useB to "i)e a inal honest estimate o "eneraliGation. 'onse0uently, cases in the test set must be treated !ust as ne+ data +ould be treated. They cannot be in)ol)ed +hatsoe)er in the determination o the itted prediction model. ,n some applications, there may be no need or a inal honest assessment o "eneraliGation. A model can be optimiGed or per ormance on the test set by tunin" it on the )alidation set. ,t may be enou"h to kno+ that the prediction model +ill likely "i)e the best "eneraliGation possible +ithout actually bein" able to say +hat it is. ,n this situation, no test set is needed. Aith small or moderate data sets, data splittin" is ine icientF the reduced sample siGe can se)erely de"rade the it o the model. 'omputer-intensi)e methods such as cross)alidation and the bootstrap ha)e been de)eloped so that all the data can be used or both ittin" and honest assessment. 1o+e)er, data minin" usually has the luxury o massi)e data sets.

2.# $!ilding and %nterpreting Decision Trees

31

!odel #omple)it%
$oo 'le(i)le

Not 'le(i)le enou%*

22

8ittin" a model to data re0uires searchin" throu"h the space o possible models. 'onstructin" a model +ith "ood "eneraliGation re0uires choosin" the ri"ht complexity. Selectin" model complexity in)ol)es a trade-o bet+een bias and )ariance. An insu iciently complex model mi"ht not be lexible enou"h. This leads to under ittin" L systematically missin" the si"nal $hi"h bias%. A nai)e modeler mi"ht assume that the most complex model should al+ays outper orm the others, but this is not the case. An o)erly complex model mi"ht be too lexible. This +ill lead to o)er ittin" L accommodatin" nuances o the random noise in the particular sample $hi"h )ariance%. A model +ith !ust enou"h lexibility +ill "i)e the best "eneraliGation. The strate"y or choosin" model complexity in data minin" is to select the model that per orms best on the )alidation data set. >sin" per ormance on the trainin" data set usually leads to selectin" too complex a model. $The classic example o this is selectin" linear re"ression models based on R2.%

32

Chapter 2 Predictive Modeling Using Decision Trees

Over$itting
$rainin% +et $est +et

23

A )ery lexible model +as used on the abo)e classi ication problem +here the "oal +as to discriminate bet+een the blue and red classes. The classi ier it the trainin" data +ell, makin" only 13 errors amon" the 244 cases $34.-7 accuracy%. /n a resh set o data, ho+e)er, the classi ier did not do as +ell, makin" &3 errors amon" 244 cases $E-.-7 accuracy%. The lexible model snaked throu"h the trainin" data accommodatin" the noise as +ell as the si"nal.

*etter &itting
$rainin% +et $est +et

24

A more parsimonious model +as it to the trainin" data. The apparent accuracy +as not 0uite as impressi)e as the lexible model $#& errors, 6#7 accuracy%, but it "a)e better per ormance on the test set $&# errors, E6.-7 accuracy%.

2.# $!ilding and %nterpreting Decision Trees

33

E$ploring t&e Data !artition 'ode

Inspe!ting "efault Settings in the "ata Partition $ode 1. Add a Data Cartition node to the dia"ram. 2. 'onnect the Data Cartition node to the ':SSAMC.1ME2 node.

#. /pen the Data Cartition node either by double-clickin" on the node, or by ri"htclickin" and selectin" Open%.

*ou choose the method or partitionin" in the upper-le t section o the tab. • 5y de ault, Enterprise Miner takes a simple random sample o the input data and di)ides it into trainin", )alidation, and test data sets. • To per orm strati ied samplin", select the Strati ied radio button and then use the options in the Strati ied tab to set up your strata.

34

Chapter 2 Predictive Modeling Using Decision Trees

• To per orm user-de ined samplin", select the >ser De ined button and then use the options on the >ser De ined tab to identi y the )ariable in the data set that identi ies the partitions. *ou can speci y a random seed or initialiGin" the samplin" process in the lo+er-le t section o the tab. :andomiGation +ithin computer pro"rams is o ten started by some type o seed. , you use the same data set +ith the same seed in di erent lo+s, you "et the same partition. /bser)e that re-sortin" the data +ill result in a di erent orderin" o data and, there ore, a di erent partition, +hich +ill potentially yield di erent results. The ri"ht side o the tab enables you to speci y the percenta"e o the data to allocate to trainin", )alidation, and test data. Cartition the 1ME2 data or modelin". 5ased on the data a)ailable, create trainin" and )alidation data sets and omit the test data. &. Set Train, ?alidation, and Test to .E, ##, and 4, respecti)ely. -. 'lose the Data Cartition node, and select 1es to sa)e chan"es +hen prompted.

2.# $!ilding and %nterpreting Decision Trees

35

2.3 Introduction to Decision Trees

Objectives
  

Explore the general concept of decision trees. Understand the different decision tree algorithms. Disc!ss the &enefits and draw&acks of decision tree models.

26

36

Chapter 2 Predictive Modeling Using Decision Trees

&itted Decision Tree
New Case DEBTINC = 20 NINQ = 2 DELINQ = 0 Income = 42K . (4AD 5 D*BT$N, 012 D*+$N% 1/2 )'.31 -.,2 N$N% ,1 &'≥ 12 (--

)'-

27

5ankin" marketin" scenarioB Tar"et 9 ,nputs 9 de ault on a home-e0uity line o credit $5AD% number o delin0uent trade lines $DE=,;2% number o credit in0uiries $;,;2% debt to income ratio $DE5T,;'% possibly many other inputs ,nterpretation o the itted decision tree is strai"ht or+ard. The internal nodes contain rules that in)ol)e one o the input )ariables. Start at the root node $top% and ollo+ the rules until a terminal node $leaf% is reached. The lea)es contain the estimate o the expected )alue o the tar"et L in this case the posterior probability o 5AD. The probability can then be used to allocate cases to classes. ,n this case, "reen denotes 5AD and red denotes other+ise. Ahen the tar"et is cate"orical, the decision tree is called a classification tree. Ahen the tar"et is continuous, it is called a regression tree.

2.# $!ilding and %nterpreting Decision Trees

37

Divide and #on+uer
n 5 23...

1.- 4AD "es n 5 3 2. De)t/to/Inco7e !atio 0 12 no n 5 1362.

2- 4AD
29

21- 4AD

The tree is itted to the data by recursive partitioning. Cartitionin" re ers to se"mentin" the data into sub"roups that are as homo"eneous as possible +ith respect to the tar"et. ,n this case, the binary split $Debt-to-,ncome :atio M &-% +as chosen. The -,444 cases +ere split into t+o "roups, one +ith a -7 5AD rate and the other +ith a 217 5AD rate. The method is recursi)e because each sub"roup results rom splittin" a sub"roup rom a pre)ious split. Thus, the #,#-4 cases in the le t child node and the 1,.-4 cases in the ri"ht child node are split a"ain in similar ashion.

3

Chapter 2 Predictive Modeling Using Decision Trees

The #ultivation o$ Trees
 





plit earch * +hich splits are to &e considered, plitting Criterion * +hich split is &est, topping -!le * +hen sho!ld the splitting stop, Pr!ning -!le * ho!ld some &ranches &e lopped off,

29

,ossible Splits to #onsider
1//)/// #//)/// 0//)/// 2//)/// .//)/// . 2 # 2 3 ./ .2 .# .2 .3 2/

4ominal %np!t

Ordinal %np!t

%np!t 5evels
3%

The number o possible splits to consider is enormous in all but the simplest cases. ;o split search al"orithm exhausti)ely examines all possible partitions. ,nstead, )arious restrictions are imposed to limit the possible splits to consider. The most common restriction is to look at only binary splits. /ther restrictions in)ol)e binnin" continuous inputs, step+ise search al"orithms, and samplin".

2.# $!ilding and %nterpreting Decision Trees

39

Splitting #riteria
5eft 4ot $ad $ad 3196 154 5eft 4ot $ad $ad 4ot $ad $ad
31

-ight 13%4 346 #1// 1// De&t6to6%ncome -atio 7 #1

Center -ight 11 162 % 5%% 791 223 #1// 1// #1// 1// 8 Competing Three6+a( plit

2521 115 45%% %

Perfect plit

1o+ is the best split determinedH ,n some situations, the +orth o a split is ob)ious. , the expected tar"et is the same in the child nodes as in the parent node, no impro)ement +as made, and the split is +orthless. ,n contrast, i a split results in pure child nodes, the split is undisputedly best. 8or classi ication trees, the three most +idely used splittin" criteria are based on the Cearson chi-s0uared test, the @ini index, and entropy. All three measure the di erence in class distributions across the child nodes. The three methods usually "i)e similar results.

4%

Chapter 2 Predictive Modeling Using Decision Trees

The Right-Si.ed Tree
t!nting

Pr!ning

32

8or decision trees, model complexity is the number o lea)es. A tree can be continually split until all lea)es are pure or contain only one case. This tree +ould "i)e a per ect it to the trainin" data but +ould probably "i)e poor predictions on ne+ data. At the other extreme, the tree could ha)e only one lea $the root node%. E)ery case +ould ha)e the same predicted )alue $no-data rule%. There are t+o approaches to determinin" the ri"ht-siGed treeB 1. >sin" or+ard-stoppin" rules to stunt the "ro+th o a tree $preprunin"%. A uni)ersally accepted preprunin" rule is to stop "ro+in" i the node is pure. T+o other popular rules are to stop i the number o cases in a node alls belo+ a speci ied limit or to stop +hen the split is not statistically si"ni icant at a speci ied le)el. 2. @ro+in" a lar"e tree and prunin" back branches $postprunin"%. Costprunin" creates a se0uence o trees o increasin" complexity. An assessment criterion is needed or decidin" the best $sub% tree. The assessment criteria are usually based on per ormance on holdout samples $)alidation data or +ith cross)alidation%. 'ost or pro it considerations can be incorporated into the assessment. Creprunin" is less computationally demandin" but runs the risk o missin" uture splits that occur belo+ +eak splits.

2.# $!ilding and %nterpreting Decision Trees

41

A &ield (uide to Tree Algorithms

)ID T+)ID (+)ID ()*T
33

ID3 (4.5 (5.%

1undreds o decision tree al"orithms ha)e been proposed in the statistical, machine learnin", and pattern reco"nition literature. The most commercially popular are 'A:T, '1A,D, and '&.- $'-.4%. There are many )ariations o the 'A:T $classi ication and re"ression trees% al"orithm $5reiman et al. 136&%. The standard 'A:T approach is restricted to binary splits and uses post-prunin". All possible binary splits are considered. , the data is )ery lar"e, +ithin-node samplin" can be used. The standard splittin" criterion is based on the @ini index or classi ication trees and )ariance reduction or re"ression trees. /ther criteria or multiclass problems $the t+oin" criterion% and re"ression trees $least absolute de)iation% are also used. A maximal tree is "ro+n and pruned back usin" v- old cross-)alidation. ?alidation data can be used i there is su icient data. '1A,D $chi-s0uared automatic interaction detection% is a modi ication o the A,D al"orithm that +as ori"inally de)eloped in 13.# $Mor"an and Son0uist 13.#, Nass 1364%. '1A,D uses multi+ay splits and preprunin" or "ro+in" classi ication trees. ,t inds the best multi+ay split usin" a step+ise a""lomerate al"orithm. The split search al"orithm is desi"ned or cate"orical inputs, so continuous inputs must be discretiGed. The splittin" and stoppin" criteria are based on statistical si"ni icance $'hi-s0uared test%. The ,D# amily o classi ication trees +as de)eloped in the machine learnin" literature $2uinlan 133#%. '&.- only considers L-+ay splits or L-le)el cate"orical inputs and binary splits or continuous inputs. The splittin" criteria are based on in ormation $entropy% "ain. Costprunin" is done usin" pessimistic ad!ustments to the trainin" set error rate.

42

Chapter 2 Predictive Modeling Using Decision Trees

*ene$its o$ Trees


%nterpreta&ilit( * tree6str!ct!red presentation Mixed Meas!rement cales * nominal) ordinal) interval -egression trees -o&!stness Missing 9al!es



  

35

The tree dia"ram is use ul or assessin" +hich )ariables are important and ho+ they interact +ith each other. The results can o ten be +ritten as simple rules such asB , $DE5T,;' ≥ &-% or $Debits M &- and 1 ≤ DE=,;2 ≤ 2% or $Debits M &- and AD5 O 2 and ;,;2 O 1%, then 5AD9yes, other+ise no. Splits based on numeric input )ariables depend only on the rank order o the )alues. =ike many nonparametric methods based on ranks, trees are robust to outliers in the input space. :ecursi)e partitionin" has special +ays o treatin" missin" )alues. /ne approach is to treat missin"s as a separate le)el o the input )ariable. The missin"s could be "rouped +ith other )alues in a node or ha)e their o+n node. Another approach is to use surro"ate splitsF i a particular case has a missin" )alue or the chosen split, you can use a nonmissin" input )ariable that "i)es a similar split instead.

*ene$its o$ Trees
Pro&


8!tomaticall( * Detects interactions :8%D; * 8ccommodates nonlinearit( * elects inp!t varia&les

%np!t M!ltivariate tep <!nction
35

%np!t

2.# $!ilding and %nterpreting Decision Trees

43

Dra/bac0s o$ Trees
  

-o!ghness 5inear) Main Effects %nsta&ilit(

36

The itted model is composed o discontinuous lat sur aces. The predicted )alues do not )ary smoothly across the input space like other models. This rou"hness is the trade-o or interpretability. A step unction itted to a strai"ht line needs many small steps. Strati yin" on an input )ariable that does not interact +ith the other inputs needlessly complicates the structure o the model. 'onse0uently, linear additi)e inputs can produce complicated trees that miss the simple structure. Trees are unstable because small perturbations in the trainin" data can sometimes ha)e lar"e e ects on the topolo"y o the tree. The e ect o alterin" a split is compounded as it cascades do+n the tree and as the sample siGe decreases.

44

Chapter 2 Predictive Modeling Using Decision Trees

2.4 Building and Interpreting Decision Trees

Objectives
   

Explore the t(pes of decision tree models availa&le in Enterprise Miner. $!ild a decision tree model. Examine the model res!lts and interpret these res!lts. Choose a decision threshold theoreticall( and empiricall(.

3

2.# $!ilding and %nterpreting Decision Trees

45

Building and Interpreting Decision Trees

To complete the irst phase o your irst dia"ram, add a Tree node and an Assessment node to the +orkspace and connect the nodes as sho+n belo+B

Examine the de ault settin" or the decision tree. 1. Double-click on the Tree node to open it. 2. Examine the ?ariables tab to ensure all )ariables ha)e the appropriate status, model role, and measurement le)el.



, the model role or measurement le)el +ere not correct, it could not be corrected in this node. *ou +ould return to the input data source node to make the corrections.

46

Chapter 2 Predictive Modeling Using Decision Trees

#. Select the &asi! tab.

Many o the options discussed earlier or buildin" a decision tree are controlled in this tab. The splittin" criteria a)ailable depend on the measurement le)el o the tar"et )ariable. 8or binary or nominal tar"et )ariables, the de ault splittin" criterion is the chi-s0uare test +ith a si"ni icance le)el o 4.2. Alternately, you could choose to use entropy reduction or @ini reduction as the splittin" criterion. 8or an ordinal tar"et )ariable, only entropy reduction or @ini reduction are a)ailable. 8or an inter)al tar"et )ariable, you ha)e a choice o t+o splittin" criteriaB the de ault 8 test or )ariance reduction. The other options a)ailable in this tab a ect the "ro+th and siGe o the tree. 5y de ault, only binary splits are permitted, the maximum depth o the tree is . le)els, and the minimum number o obser)ations in a lea is 1. 1o+e)er, there is also a settin" or the re0uired number o obser)ations in a node in order to split the node. The de ault is the total number o obser)ations a)ailable in the trainin" data set di)ided by 144.



There are additional options a)ailable in the Ad)anced tab. All o the options are discussed in "reater detail in the Decision Tree Modelin" course.

2.# $!ilding and %nterpreting Decision Trees

47

&. 'lose the Tree node. -. :un the dia"ram rom the Tree node. :i"ht-click on the Tree node and select )un. .. Ahen prompted, select 1es to )ie+ the results. Ahen you )ie+ the results o the tree node, the All tab is acti)e and displays a summary o se)eral o the subtabs.

8rom the "raph in the lo+er-ri"ht corner, you can see that a tree +ith 16 lea)es +as ori"inally "ro+n based on the trainin" data set and pruned back to a tree +ith 6 lea)es based on the )alidation data set. The table in the lo+er-le t corner sho+s that the 6-lea model has an accuracy o 63.427 on the )alidation data set.

4

Chapter 2 Predictive Modeling Using Decision Trees

E. ?ie+ the tree by selectin" View  Tree rom the menu bar. A portion o the tree appears belo+.

Althou"h the selected tree +as supposed to ha)e ei"ht lea)es, not all ei"ht lea)es are )isible. 5y de ault, the decision tree )ie+er displays three le)els deep. To modi y the le)els that are )isible, proceed as ollo+sB 1. Select View  Tree Options%. 2. Type 6 in the Tree depth do+n ield. #. Select O-. &. ?eri y that all ei"ht lea)es are )isible. The colors in the tree rin" dia"ram and the decision tree itsel indicate node purity by de ault. , the node contains all ones or all Geros, the node is colored red. , the node contains an e0ual mix o ones and Geros, it is colored yello+.

2.# $!ilding and %nterpreting Decision Trees

49

*ou can chan"e the colorin" scheme as ollo+sB 1. Select Tools  "efine 'olors.

2. Select the Proportion of a target +alue radio button.

#. Select 2 in the Select a tar"et )alue table. Selectin" Gero as the tar"et )alue makes the lea)es +ith all Geros "reen and those +ith no Geros $that is, all ones% red. ,n other +ords, lea)es that include only indi)iduals +ho +ill de ault on their loan +ill be red. &. Select O-. ,nspect the tree dia"ram to identi y the terminal nodes +ith a hi"h percenta"e o bad loans $colored red% and those +ith a hi"h percenta"e o "ood loans $colored "reen%.

5%

Chapter 2 Predictive Modeling Using Decision Trees

*ou can also chan"e the statistics displayed in the tree nodes. 1. Select View  Statisti!s%.

2. To turn o the count per class, ri"ht-click in the Select column in the 'ount per class ro+. Select Set Sele!t  $o rom the pop-up menus. #. Turn o the ; in node, Credicted ?alue, Trainin" Data, and ;ode ,D ro+s in the same +ay. This enables you to see more o the tree on your screen. &. Select O-.

;ote that the initial split on DE5T,;' has produced t+o branches. Do the ollo+in" to determine +hich branch contains the missin" )aluesB

2.# $!ilding and %nterpreting Decision Trees

51

1. Cosition the tip o the cursor abo)e the )ariable name DE5T,;' directly belo+ the root node in the tree dia"ram. 2. :i"ht-click and select View !ompeting splits%. The 'ompetin" Splits +indo+ opens. The table lists the top i)e inputs considered or splittin" as ranked by a measure o +orth.

#. Select the ro+ or the )ariable DE5T,;'. &. Select &rowse rule. The Modi y ,nter)al ?ariable Splittin" :ule +indo+ opens.

The table presents the selected ran"es or each o the branches as +ell as the branch number that contains the missin" )alues. ,n this case the branch that contains the )alues "reater than &-.16&6 also contains the missin" )alues. -. 'lose the Modi y ,nter)al ?ariable Splittin" :ule +indo+, the 'ompetin" Splits +indo+, and the tree dia"ram.

52

Chapter 2 Predictive Modeling Using Decision Trees



*ou can also see splittin" in ormation usin" the Tree :in" tab in the :esultsTree +indo+. >sin" the ?ie+ ,n o tool, you can click on the partitions in the tree rin" plot to see the )ariable and cuto )alue used or each split. The siGes o the resultin" nodes are proportional to the siGe o the se"ments in the tree rin" plot. *ou can see the split statistics by selectin" View  Pro,e tree ring statisti!s. *ou can )ie+ a path to any node by selectin" it and then selectin" View  Path.

*ou can also determine the )ariables that +ere important in "ro+in" the tree in the Score tab. 1. Select the S!ore tab. 2. Select the Varia,le Sele!tion subtab.

This subtab "i)es the relati)e importance o )ariables used in "ro+in" the tree. ,t also can be used to export ne+ )ariable roles, +hich is discussed later in the course.

2.# $!ilding and %nterpreting Decision Trees

53

#. 'lose the :esults +indo+ and sa)e the chan"es +hen prompted.



$ew Tree Viewer A ne+ tree )ie+er +ill be a)ailable in a uture )ersion o Enterprise Miner. To obtain access to this ne+ )ie+er, 1. ,n the command bar, type the statement %let emv4tree=1.

2. Cress the return key. #. :eturn to the Enterprise Miner +indo+. &. :i"ht-click on the Tree node and select $ew +iew%. Using Tree Options *ou can make ad!ustments to the de ault tree al"orithm that causes your tree to "ro+ di erently. These chan"es do not necessarily impro)e the classi ication per ormance o the tree, but they may impro)e its interpretability. The Tree node splits a node into t+o nodes by de ault $called binary splits%. ,n theory, trees usin" multi+ay splits are no more lexible or po+er ul than trees usin" binary splits. The primary "oal is to increase interpretability o the inal result. 'onsider a competin" tree that allo+s up to &-+ay splits. 1. 'lick on the label or the tree in the dia"ram, and chan"e the label to De ault !ree. 2. Add another Tree node to the +orkspace. #. 'onnect the Data Cartition node to the Tree node. &. 'onnect the Tree node to the Assessment node.

54

Chapter 2 Predictive Modeling Using Decision Trees

-. /pen the ne+ Tree node. .. Select the &asi! tab. E. Enter 4 or the Maximum number o branches rom a node ield. This option +ill allo+ binary, #-+ay, and &-+ay splits to be considered.

6. 'lose the Tree node, sa)in" chan"es +hen prompted. 3. Enter D!4"ay as the model name +hen prompted. This +ill remind you that you speci ied up to &-+ay splits. 14. Select O-. 11. :un the lo+ rom this Tree node and )ie+ the results. The number o lea)es in the selected tree has increased rom 6 to ##. ,t is a matter o pre erence as to +hether this tree is more comprehensible than the binary split tree. The increased number o lea)es su""ests to some a lo+er de"ree o comprehensibility. The accuracy on the )alidation set is only 4.2-7 hi"her than the de ault model in spite o "reatly increased complexity.

2.# $!ilding and %nterpreting Decision Trees

55

, you inspect the tree dia"ram, there are many nodes containin" only a e+ applicants. *ou can employ additional culti)ation options to limit this phenomenon. 12. 'lose the :esults +indo+. *imiting Tree 3rowth ?arious stoppin" or stuntin" rules $also kno+n as preprunin"% can be used to limit the "ro+th o a decision tree. 8or example, it may be deemed bene icial not to split a node +ith e+er than -4 cases and re0uire that each node ha)e at least 2- cases. Modi y the most recently created Tree node and employ these stuntin" rules to keep the tree rom "eneratin" so many small terminal nodes. 1. /pen the Tree node. 2. Select the &asi! tab. #. Type #$ or the minimum number o obser)ations in a lea and then press the Enter key. &. Type $% or the number o obser)ations re0uired or a split search and then press the Enter key.



The Decision Tree node re0uires that $/bser)ations re0uired or a split search% ≥ 2∗$Minimum number o obser)ations in a lea %. ,n this example, the obser)ations re0uired or a split search must be "reater than 2 ∗2-9-4. A node +ith e+er than -4 obser)ations cannot be split into t+o nodes +ith each ha)in" at least 2- obser)ations. , you speci y numbers that )iolate this re0uirement, you +ill not be able to close the +indo+.

-. 'lose and sa)e your chan"es to the Tree node.



, the Tree node does not prompt you to sa)e chan"es +hen you close, the settin"s ha)e not been chan"ed. :eopen the node and modi y the settin"s a"ain.

56

Chapter 2 Predictive Modeling Using Decision Trees

.. :erun the Tree node and )ie+ the results as be ore.

The optimal tree no+ has 6 lea)es. The )alidation accuracy has dropped sli"htly to 66.-.7. E. Select View  Tree to see the tree dia"ram.

;ote that the initial split on DE5T,;' has produced our branches. 6. 'lose the tree dia"ram and results +hen you are inished )ie+in" them. 'omparing Models The Assessment node is use ul or comparin" models. 1. To run the dia"ram rom the Assessment node, ri"ht-click on the Assessment node and select )un. 2. Ahen prompted, select 1es to )ie+ the results.

2.# $!ilding and %nterpreting Decision Trees

57

#. ,n the Assessment Tool +indo+, click and dra" to select both o the models. &. Select Tools  *ift 'hart.

A 'umulati)e 7:esponse chart is sho+n by de ault. 5y de ault, this chart arran"es people into deciles based on their predicted probability o response, and then plots the actual percenta"e o respondents. To see actual )alues, click on the ?ie+ ,n o tool and then click on one o the lines or the models. 'lickin" on the Tree-2 line near the upper-le t corner o the plot indicates a 7:esponse o 62.4., but +hat does that meanH To interpret the 'umulati)e 7:esponse chart, consider ho+ the chart is constructed. • 8or this example, a responder is de ined as someone +ho de aulted on a loan $5AD91%. 8or each person, the itted model $in this case, a decision tree% predicts

5

Chapter 2 Predictive Modeling Using Decision Trees

the probability that the person +ill de ault. Sort the obser)ations by the predicted probability o response rom the hi"hest probability o response to the lo+est probability o response. • @roup the people into ordered bins, each containin" approximately 147 o the data in this case. • >sin" the tar"et )ariable 5AD, count the percenta"e o actual responders in each bin. , the model is use ul, the proportion o responders $de aulters% +ill be relati)ely hi"h in bins +here the predicted probability o response is hi"h. The cumulati)e response cur)e sho+n abo)e sho+s the percenta"e o respondents in the top 147, top 247, top #47, and so on. ,n the top 147, o)er 647 o the people +ere de aulters. ,n the top 247, the proportion o de aulters has dropped to !ust o)er E27 o the people. The horiGontal line represents the baseline rate $approximately 247% or comparison purposes, +hich is an estimate o the percenta"e o de aulters that you +ould expect i you +ere to take a random sample. The plot abo)e represents cumulati)e percenta"es, but you can also see the proportion o responders in each bin by selectin" the radio button next to ;on-'umulati)e on the le t side o the "raph. Select the radio button next to ;on-'umulati)e and inspect the plot. Cumulative %Response Non-Cumulative %Response

The ;on-'umulati)e chart sho+s that once you "et beyond the 24 th percentile or predicted probability, the de ault rate is lo+er than +hat you +ould expect i you +ere to take a random sample. Select the 'umulati+e button and then select *ift Value. =i t charts plot the same in ormation on a di erent scale. :ecall that the population response rate is about 247. A li t chart can be obtained by di)idin" the response rate in each percentile by the population response rate. The li t chart, there ore, plots relati)e impro)ement o)er baseline.

2.# $!ilding and %nterpreting Decision Trees

59

Cumulative %Response

Cumulative Lift Value

:ecall that the percenta"e o de aulters in the top 147 +as 62.4.7. Di)idin" 62.4.7 by about 247 $baseline rate% results in a number sli"htly hi"her than &, +hich indicates that you +ould expect to "et o)er & times as many de aulters in this top "roup as you +ould rom takin" a simple random sample o the same siGe.

6%

Chapter 2 Predictive Modeling Using Decision Trees

,nstead o askin" the 0uestion PAhat percenta"e o obser)ations in a bin +ere respondersHP, you could ask the 0uestion PAhat percenta"e o the total number o responders are in a binHP This can be e)aluated usin" the 'aptured :esponse cur)e. To inspect this cur)e, select the radio button next to 4'aptured )esponse. >se the ?ie+ ,n o tool to e)aluate ho+ the model per orms.

/bser)e that i the percenta"e o applications chosen or re!ection +ere approximately • 247, you +ould ha)e identi ied about E47 o the people +ho +ould ha)e de aulted $a li t o about #.-Q%. • &47, you +ould ha)e identi ied o)er 647 o the people +ho +ould ha)e de aulted $a li t o o)er 2Q%. 'lose the =i t 'hart and Assessment Tool +indo+s. Intera!ti+e Training Decision tree splits are selected on the basis o an analytic criterion. Sometimes it is necessary or desirable to select splits on the basis o a practical business criterion. 8or example, the best split or a particular node may be on an input that is di icult or expensi)e to obtain. , a competin" split on an alternati)e input has a similar +orth and is cheaper and easier to obtain, it makes sense to use the alternati)e input or the split at that node. =ike+ise, splits may be selected that are statistically optimal but may be in con lict +ith an existin" business practice. 8or example, the credit department may treat applications +here debt-to-income ratios are not a)ailable di erently rom those +here this in ormation is a)ailable. *ou can incorporate this type o business rule into your decision tree usin" interacti)e trainin" in the Tree node. ,t mi"ht then be

2.# $!ilding and %nterpreting Decision Trees

61

interestin" to compare the statistical results o the ori"inal tree +ith the chan"ed tree. ,n order to accomplish this, irst make a copy o the De ault Tree node. 1. Select the De ault Tree node +ith the ri"ht mouse button and then select 'opy. 2. Mo)e your cursor to an empty place abo)e the De ault Tree node, ri"ht-click, and select Paste. :ename this node &nteractive !ree. #. 'onnect the ,nteracti)e Tree node to the Data Cartition node and the Assessment node as sho+n.

&. :i"ht-click on the ,nteracti)e Tree node and select Intera!ti+e%. The ,nteracti)e Trainin" +indo+ opens.

-. Select View  Tree rom the menu bar.

62

Chapter 2 Predictive Modeling Using Decision Trees

The de ault decision tree is displayed.

*our "oal is to modi y the initial split so that one branch contains all the applications +ith missin" debt-to-income data and the other branch contains the rest o the applications. 8rom this initial split, you +ill use the decision treeRs analytic method to "ro+ the remainder o the tree. 1. Select the 'reate :ule icon, , on the toolbar.

2. Select the root node o the tree. The 'reate :ule +indo+ opens listin" potential splittin" )ariables and a measure o the +orth o each input.

#. Select the ro+ correspondin" to DE5T,;'. &. Select Modify )ule. The Modi y ,nter)al ?ariable Splittin" :ule +indo+ opens.

2.# $!ilding and %nterpreting Decision Trees

63

-. Select the ro+ or ran"e 2. .. Select )emo+e range. The split is no+ de ined to put all nonmissin" )alues o DE5T,;' into node 1 and all missin" )alues o DE5T,;' into node 2. E. Select O- to close the Modi y ,nter)al ?ariable Splittin" :ule +indo+. 6. Select O- in the 'reate :ule +indo+. The 'reate :ule +indo+ closes and the tree dia"ram is updated as sho+n.

The le t node contains any )alue o DE5T,;', and the ri"ht node contains only missin" )alues or DE5T,;'. 3. 'lose the tree dia"ram and the ,nteracti)e Trainin" +indo+.

64

Chapter 2 Predictive Modeling Using Decision Trees

14. Select 1es to sa)e the tree as input or subse0uent trainin".

11. :un the modi ied ,nteracti)e Tree node and )ie+ the results. The selected tree has 11 nodes. ,ts )alidation accuracy is 66.E17.

To compare the tree modelsB 1. 'lose the :esults +indo+. 2. To rename the ne+ model, ri"ht-click on the ,nteracti)e Tree node and select Model Manager%. #. 'han"e the name rom >ntitled to &nteractive.

2.# $!ilding and %nterpreting Decision Trees

65

&. 'lose the Model Mana"er +indo+. :i"ht-click on the Assessment node and select )esults%. -. Enter De ault as the name or the de ault tree model $currently >ntitled%.

.. 'lick and dra" to select all three ro+s that correspond to the tree models. E. Select Tools  *ift 'hart. 6. Select #ormat  Model $ame.



*ou may ha)e to maximiGe the +indo+ or resiGe the le"end in order to see the entire le"end.

66

Chapter 2 Predictive Modeling Using Decision Trees

The per ormance o the three tree models is not appreciably di erent. 'lose the li t chart +hen you are inished inspectin" the results.

2.# $!ilding and %nterpreting Decision Trees

67

#onse+uences o$ a Decision

Decision 1 )ctual 1 )ctual % True Positive False Positive

Decision % False Negative True Negative

4%

,n order to choose the appropriate threshold to classi y obser)ations positi)ely or ne"ati)ely, the cost o misclassi ication must be considered. ,n the home e0uity line o credit example, you are modelin" the probability o a de ault, +hich is coded as a 1. There ore, Enterprise Miner sets up the pro it matrix as sho+n abo)e.

E)ample
-ecall the home e'!it( line of credit scoring example. Pres!me that ever( two dollars loaned event!all( ret!rns three dollars if the loan is paid off in f!ll.

41

Assume that e)ery t+o dollars loaned returns three dollars i the borro+er does not de ault. :e!ectin" a "ood loan or t+o dollars or"oes the expected dollar pro it. Acceptin" a bad loan or t+o dollars or"oes the t+o-dollar loan itsel $assumin" that the de ault is early in the repayment period%.

6

Chapter 2 Predictive Modeling Using Decision Trees

#onse+uences o$ a Decision

Decision 1 )ctual 1 )ctual % True Positive False Positive (cost=$1)

Decision % False Negative (cost=$2) True Negative

42

The costs o misclassi ication are sho+n in the table.

*a%es Rule

θ =

1 cost o alse ne"ati)e 1+ cost o alse positi)e

43

/ne +ay to determine the appropriate threshold is a theoretical approach. This approach uses the plu" in 5ayes rule. >sin" simple decision theory, the optimal threshold is "i)en by θ. >sin" the cost structure de ined or the home e0uity example, the optimal threshold is 1($1S$2(1%% 9 1(#. That is, re!ect all applications +hose predicted probability o de ault exceeds 4.##.

2.# $!ilding and %nterpreting Decision Trees

69

#onse+uences o$ a Decision

Decision 1 )ctual 1 )ctual % True Positive (profit=$2) False Positive (profit=$-1)

Decision % False Negative True Negative

44

*ou can obtain the same result usin" the Assessment node in Enterprise Miner by usin" the pro it matrix to speci y the pro it associated +ith the le)el o the response bein" modeled $in this case, a loan de ault or a 1%. As a bonus, you can estimate the raction o loan applications you must re!ect +hen usin" the selected threshold.

7%

Chapter 2 Predictive Modeling Using Decision Trees

(&oosing a Decision T&res&old

8irst, consider the decision threshold determined theoretically. 1. :eturn to the Cro!ect1 dia"ram, open the "efault Tree node, and select the S!ore tab. 2. 'heck the box next to Trainin", ?alidation, and Test. This adds predicted )alues to the data sets.

#. 'lose the tree node, sa)in" chan"es +hen prompted.

2.# $!ilding and %nterpreting Decision Trees

71

&. Add an ,nsi"ht node a ter the De ault Tree node.

-. /pen the ,nsi"ht node. .. ,n the Data tab, select the Sele!t% button to see a list o predecessor data sets. E. 'hoose the )alidation data set rom the De ault Tree node.

6. Select O-. 3. ,n the Data tab o the ,nsi"ht Settin"s +indo+, select the radio button next to Entire "ata Set so that ,nsi"ht +ill use the entire )alidation data set.

72

Chapter 2 Predictive Modeling Using Decision Trees

14. 'lose the node, sa)in" chan"es +hen prompted. 11. :un the ,nsi"ht node and )ie+ the results +hen prompted.

/ne o the ne+ )ariables in the data set is CT5AD1, +hich is the predicted probability o a tar"et )alue o 1 $a loan de ault%. To sort the data set based on this )ariableB 12. 'lick on the trian"le in the top le t o the data table and select Sort%.

1#. ,n the Sort +indo+, select P5&A"6  1.

2.# $!ilding and %nterpreting Decision Trees

73

1&. 1i"hli"ht P5&A"6 in the * column and select As!7"es to sort in descendin" order. 1-. Select O-. 1.. Scroll throu"h the data table and note that #64 o the obser)ations ha)e a predicted probability o de ault "reater than 1(#.

There ore, based on the theoretical approach, #64 out o 13.E applications, or approximately 137, should be re!ected. *ou can obtain the same result usin" the Assessment node. 1. 'lose the ,nsi"ht data table. 2. :i"ht-click on the Assessment node and select )esults%. #. Select the de ault model in the Assessment node.

74

Chapter 2 Predictive Modeling Using Decision Trees

&. Select Tools  *ift 'hart rom the menu bar. -. ,n the bottom-le t corner o the =i t 'hart +indo+, select Edit% to de ine a tar"et pro ile. .. ,n the Editin" Assessment Cro ile or 5AD +indo+, ri"ht-click in the open area +here the )ectors and matrices are listed and select Add.

2.# $!ilding and %nterpreting Decision Trees

75

E. 1i"hli"ht the ne+ Cro it matrix, and enter the )alues in the matrix as pictured belo+.



8or credit screenin", a tar"et )alue o 1 implies a de ault and, thus, a loss. A tar"et )alue o 4 implies a paid repaid loan and, thus, a pro it. The ixed cost o processin" each loan application is insubstantial and taken to be Gero.

6. :i"ht-click on the Cro it matrix and select Set to Use. The pro it matrix is no+ acti)e as indicated by the asterisk next to it.

3. 'lose the Cro it Matrix De inition +indo+, sa)in" chan"es +hen prompted. 14. Select Apply.

11. Select the Profit radio button. 12. Select the $on8'umulati+e radio button.

76

Chapter 2 Predictive Modeling Using Decision Trees

The plot sho+s the actual pro it or each percentile o loan applications as ranked by the decision tree model. Cercentiles up to and includin" the 24 th percentile sho+ a pro it or re!ectin" the applicants. There ore, it makes sense to re!ect the top 247 o loan applications. This a"rees +ith the results obtained theoretically.



,n Enterprise Miner, the ;on-'umulati)e pro it chart ne)er dips belo+ Gero. This is because a cuto )alue is chosen and there is no cost belo+ this le)el because there is no action. As a result, the cumulati)e pro it chart can be misleadin".

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close