Stern’s MBA 1 students expect to make the big bucks after graduation!
Data File: salary.doc
Stern’s school of business is a very reputable university. Students generally attend
this school not only to increase their knowledge base of business but also to increase their
salary. I am interested in finding out what leads MBA 1 student to believe that they going
to earn a certain salary after graduation. In other word, what factors affect the expected
salary of students? As I could not find enough meaningful data on the topic, I decided to
conduct my own survey. I was actually very surprised to see how cooperative students
were in filling my survey when I gave them candies in return!
Here are some descriptive statistics for all 40 MBA 1 surveyed. Variables include
expected salary after graduation (expected), age of the person (age), the number of year
of working experience in the chosen industry after graduation (num.of), if the person
plans to work in finance (plan fin), if the person plans to work in consulting (plan con), if
the person plans to work in marketing (plan mar), if the person plans to work in other
industries (plan oth), if the person is unsure about the industry (unsure), if the person
plans to work for a Fortune 500 Company (500 comp), the number of hours he/she plans
to work a week, and his/her past salary (past sal).
Descriptive Statistics
Variable N Mean Median Tr Mean StDev SE Mean
Expected 40 84250 80000 83333 17213 2722
age 40 26.825 26.000 26.694 2.490 0.394
num. of 40 1.725 0.000 1.500 2.449 0.387
plan fin 40 0.6750 1.0000 0.6944 0.4743 0.0750
plan con 40 0.2500 0.0000 0.2222 0.4385 0.0693
plan mar 40 0.0750 0.0000 0.0278 0.2667 0.0422
plan oth 40 0.00000 0.00000 0.00000 0.00000 0.00000
unsure 40 0.00000 0.00000 0.00000 0.00000 0.00000
500 comp 40 0.5000 0.5000 0.5000 0.5064 0.0801
Hrs. of 40 70.62 70.00 70.42 12.57 1.99
past sal 40 54250 48500 51667 20664 3267
Variable Min Max Q1 Q3
Expected 60000 125000 70000 97500
age 23.000 33.000 25.000 29.000
num. of 0.000 8.000 0.000 3.000
plan fin 0.0000 1.0000 0.0000 1.0000
plan con 0.0000 1.0000 0.0000 0.7500
plan mar 0.0000 1.0000 0.0000 0.0000
plan oth 0.00000 0.00000 0.00000 0.00000
unsure 0.00000 0.00000 0.00000 0.00000
500 comp 0.0000 1.0000 0.0000 1.0000
Hrs. of 50.00 95.00 60.00 80.00
past sal 30000 125000 40000 60000
At a first look at the data, there are different things happening. None of the 40
students claim to be unsure about which industry they want to go in after graduation and
none of them are planning to work in an industry other than finance, consulting or
marketing. Moreover, the average expected salary is $84,250 is much higher than the
actual salary of graduating MBA students in 1996 of $70,000 (footnote). This
discrepancy could mean that MBA 1 students are rather optimistic.
Salaries are often right tailed. Let’s check the distribution of both expected and
past salaries.
Let’s check potential outliers in both expected and past salaries:
130000
70000
120000
Expected salary
past salary
120000
110000
100000
90000
80000
70000
20000
60000
130000
130000
120000
120000
Expected salary
Expected salary
There are apparently 3 outliers in the past salaries observations.
Now, let’s look at the distribution of the different industries students plan to work in.
110000
100000
90000
80000
70000
110000
100000
90000
80000
70000
60000
60000
0
1
0
plan fin.
1
plan cons.
130000
Expected salary
120000
110000
100000
90000
80000
70000
60000
0
1
plan mark.
There are only 2 students out of 40 who plan to work in the marketing industry. This
variable apparently has a low significance. Students who plan to work in finance are
coded by 1. It is interesting to see that students who plan to work in finance and the ones
who do not actually have the same median expected salary ($80,000). It looks like
students who are planning to go in finance and those who are planning to go to consulting
are negatively correlated.
Regression Analysis
* plan mark. is highly correlated with other X variables
* plan mark. has been removed from the equation
* plan other has all values = 0
* plan other has been removed from the equation
* unsure has all values = 0
* unsure has been removed from the equation
The regression equation is
Expected salary = 44885 + 3352 age + 647 num. of yrs. of exp.
2591 plan fin. + 4025 plan cons. + 3171 500 comp?
+ 432 Hrs. of work + 0.125 past salary
Predictor Coef StDev T P VIF
Constant 44885 18068 2.48 0.018
age 3351.8 849.5 3.95 0.000 3.6
num. of 647.5 514.7 1.26 0.217 1.3
plan fin 2591 5082 0.51 0.614 4.6
plan con 4025 5344 0.75 0.457 4.4
500 comp 3171 2887 1.10 0.280 1.7
Hrs. of 431.6 122.2 3.53 0.001 1.9
past sal 0.12506 0.09684 1.29 0.206 3.2
S = 6999 RSq = 86.4% RSq(adj) = 83.5%
Analysis of Variance
Source DF SS MS F P
Regression 7 9988126411 1426875202 29.13 0.000
Error 32 1567373589 48980425
Total 39 11555500000
Source DF Seq SS
age 1 8185071089
num. of 1 75586010
plan fin 1 531542752
plan con 1 158051549
500 comp 1 84132640
Hrs. of 1 872058157
past sal 1 81684215
Unusual Observations
Obs age Expected Fit StDev Fit Residual St Resid
16 25.0 70000 86541 3156 16541 2.65R
34 27.0 120000 101312 3493 18688 3.08R
R denotes an observation with a large standardized residual
DurbinWatson statistic = 1.64
We can see that Minitab directly get rid of 3 variables. These variables are students
planning to work in marketing, planning to work in other industries and students who are
unsure about for which industry they will be working. I would also remove students who
plan consulting because it is highly negatively correlated with students who plan finance.
The overall regression is statistically significant. However, some variables have P-values
over.05.
Residuals Versus the Fitted Values
(response is Expected)
Residuals Versus the Order of the Data
(response is Expected)
3
Standardized Residual
Standardized Residual
3
2
1
0
-1
-2
2
1
0
-1
-2
-3
-3
5
10
15
20
25
Observation Order
30
35
40
55000
65000
75000
85000
95000
Fitted Value
105000
115000
125000
Normal Probability Plot of the Residuals
Histogram of the Residuals
(response is Expected)
(response is Expected)
10
2
1
Frequency
Standardized Residual
3
0
-1
5
-2
-3
-2
-1
0
1
2
0
Normal Score
-3
-2
-1
0
1
2
3
Standardized Residual
The distribution of the residual looks normal. However, we can notice couples of outliers.
Now, let’s try a regression with logged salaries for past and expected while keeping the
same variables.
Regression Analysis
* plan mark. is highly correlated with other X variables
* plan mark. has been removed from the equation
* plan other has all values = 0
* plan other has been removed from the equation
* unsure has all values = 0
* unsure has been removed from the equation
The regression equation is
log expected = 4.26 + 0.0178 age + 0.00289 num. of yrs. of exp.
0.0137 plan fin. + 0.0190 plan cons. + 0.0125 500 comp?
+ 0.00202 Hrs. of work +0.000001 past salary
Predictor Coef StDev T P VIF
Constant 4.26054 0.08967 47.52 0.000
age 0.017759 0.004216 4.21 0.000 3.6
num. of 0.002887 0.002554 1.13 0.267 1.3
plan fin 0.01366 0.02522 0.54 0.592 4.6
plan con 0.01899 0.02652 0.72 0.479 4.4
500 comp 0.01251 0.01433 0.87 0.389 1.7
Hrs. of 0.0020215 0.0006063 3.33 0.002 1.9
past sal 0.00000057 0.00000048 1.18 0.246 3.2
S = 0.03473 RSq = 86.2% RSq(adj) = 83.2%
Analysis of Variance
Source DF SS MS F P
Regression 7 0.241162 0.034452 28.56 0.000
Error 32 0.038600 0.001206
Total 39 0.279762
Source DF Seq SS
age 1 0.201723
num. of 1 0.001471
plan fin 1 0.012164
plan con 1 0.003731
500 comp 1 0.001379
Hrs. of 1 0.019007
past sal 1 0.001687
Unusual Observations
Obs age log expe Fit StDev Fit Residual St Resid
16 25.0 4.84510 4.92668 0.01566 0.08158 2.63R
34 27.0 5.07918 4.99767 0.01733 0.08151 2.71R
R denotes an observation with a large standardized residual
DurbinWatson statistic = 2.01
Logging the salaries does not change the regression model significantly. Thus, I will keep
the antilog data.
Let’s see how the regression looks without the plan consulting variable.
Regression Analysis
The regression equation is
Expected salary = 46346 + 3477 age + 574 num. of yrs. of exp.
5884 plan fin. + 2553 500 comp? + 466 Hrs. of work
+ 0.113 past salary
Predictor Coef StDev T P VIF
Constant 46346 17846 2.60 0.014
age 3476.8 827.6 4.20 0.000 3.4
num. of 573.8 501.9 1.14 0.261 1.2
plan fin 5884 2573 2.29 0.029 1.2
500 comp 2553 2750 0.93 0.360 1.6
Hrs. of 465.9 112.6 4.14 0.000 1.6
past sal 0.11303 0.09489 1.19 0.242 3.1
S = 6953 RSq = 86.2% RSq(adj) = 83.7%
Analysis of Variance
Source DF SS MS F P
Regression 6 9960345863 1660057644 34.34 0.000
Error 33 1595154137 48338004
Total 39 11555500000
Source DF Seq SS
age 1 8185071089
num. of 1 75586010
plan fin 1 531542752
500 comp 1 22521046
Hrs. of 1 1077031091
past sal 1 68593876
Unusual Observations
Obs age Expected Fit StDev Fit Residual St Resid
10 33.0 110000 111467 5049 1467 0.31 X
16 25.0 70000 86410 3131 16410 2.64R
34 27.0 120000 101124 3461 18876 3.13R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
DurbinWatson statistic = 1.67
Histogram of the Residuals
Residuals Versus the Fitted Values
(response is Expected)
(response is Expected)
9
3
34
8
Standardized Residual
Frequency
7
6
5
4
3
2
1
2
1
0
-1
-2
0
-3
-2
-1
0
1
2
16
-3
3
55000
Standardized Residual
65000
75000
85000
95000
105000
115000
125000
Fitted Value
Normal Probability Plot of the Residuals
(response is Expected)
3
Standardized Residual
34
2
1
0
-1
-2
16
-3
-2
-1
0
1
2
Normal Score
Not the outliers are still here. Now, let’s run a best subset regression to find out what variables are best to
choose for our model.
Best Subsets Regression
Response is Expected
p 5 p
n l 0 H a
u a 0 r s
m n s t
. c .
a f o s
RSq g o i m o a
Vars RSq (adj) Cp S e f n p f l
1 70.8 70.1 33.7 9417.8 X
1 57.8 56.7 64.8 11325 X
2 81.2 80.2 10.9 7660.7 X X
2 75.5 74.1 24.6 8752.2 X X
3 84.7 83.4 4.6 7012.4 X X X
3 83.2 81.8 8.1 7334.3 X X X
4 85.3 83.6 5.2 6971.0 X X X X
4 85.2 83.5 5.5 7000.3 X X X X
5 85.8 83.8 5.9 6938.4 X X X X X
5 85.6 83.5 6.3 6983.8 X X X X X
6 86.2 83.7 7.0 6952.6 X X X X X X
My choice is between the two possibilities in bold. One reason is that they have small S
and relatively high R-sq. Another reason is that C-p should be approximately P+1 =6.
Thus, I picked the one that has a C-p of 5.5 and a S of 7000.3. Here is the new regression:
Regression Analysis
The regression equation is
Expected salary = 57732 + 4034 age 6828 plan fin. + 0.0975 past
salary
+ 468 Hrs. of work
Predictor Coef StDev T P VIF
Constant 57732 16415 3.52 0.001
age 4034.1 753.2 5.36 0.000 2.8
plan fin 6828 2392 2.85 0.007 1.0
past sal 0.09755 0.09198 1.06 0.296 2.9
Hrs. of 468.5 111.7 4.19 0.000 1.6
S = 7000 RSq = 85.2% RSq(adj) = 83.5%
Analysis of Variance
Source DF SS MS F P
Regression 4 9840364419 2460091105 50.20 0.000
Error 35 1715135581 49003874
Total 39 11555500000
Source DF Seq SS
age 1 8185071089
plan fin 1 536172676
past sal 1 257832450
Hrs. of 1 861288204
Unusual Observations
Obs age Expected Fit StDev Fit Residual St Resid
8 29.0 125000 111564 2946 13436 2.12R
10 33.0 110000 115013 4497 5013 0.93 X
16 25.0 70000 87329 2846 17329 2.71R
34 27.0 120000 101545 3129 18455 2.95R
39 30.0 100000 107987 4449 7987 1.48 X
40 31.0 110000 114851 4466 4851 0.90 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
DurbinWatson statistic = 1.57
Past salaries still have a P-value above .05. So, I decide to take this variable out of the
regression.
Regression Analysis
The regression equation is
Expected salary = 69455 + 4583 age 6843 plan fin. + 501 Hrs. of work
Predictor Coef StDev T P
Constant 69455 12157 5.71 0.000
age 4583.5 547.7 8.37 0.000
plan fin 6843 2396 2.86 0.007
Hrs. of 500.9 107.7 4.65 0.000
S = 7012 RSq = 84.7% RSq(adj) = 83.4%
Analysis of Variance
Source DF SS MS F P
Regression 3 9785248786 3261749595 66.33 0.000
Error 36 1770251214 49173645
Total 39 11555500000
Source DF Seq SS
age 1 8185071089
plan fin 1 536172676
Hrs. of 1 1064005021
DurbinWatson statistic = 1.64 no autocorrelation. It confirms the
residual vrs order plot
Unusual Observations
Obs age Expected Fit StDev Fit Residual St Resid
8 29.0 125000 111047 2910 13953 2.19R
10 33.0 110000 116859 4154 6859 1.21 X
16 25.0 70000 87704 2829 17704 2.76R
34 27.0 120000 101880 3118 18120 2.88R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Residuals Versus the Order of the Data
(response is Expected)
3
34
1
0
Residuals Versus the Fitted Values
-1
(response is Expected)
-2
3
16
-3
5
Standardized Residual
2
10
15
34
20
25
30
35
40
Observation Order
1
0
-1
-2
16
-3
50000
60000
70000
80000
90000
100000
110000
120000
Fitted Value
Normal Probability Plot of the Residuals
(response is Expected)
3
Standardized Residual
2
1
0
-1
-2
-3
-2
-1
0
1
2
Histogram of the Residuals
Normal Score
(response is Expected)
10
Frequency
Standardized Residual
2
5
0
-3
-2
-1
0
1
Standardized Residual
2
3
Now we have a statistically significant model with P-value below .05. However, two outliers are still
visible in the residuals plots. We can try to get ride of these 2 oultiers (observation 34 and 16).
Regression Analysis
The regression equation is
Expected salary = 65739 + 4508 age 6676 plan fin. + 475 Hrs. of work
Predictor Coef StDev T P VIF
Constant 65739 10088 6.52 0.000
age 4508.2 474.3 9.51 0.000 1.6
plan fin 6676 2071 3.22 0.003 1.0
Hrs. of 474.7 100.6 4.72 0.000 1.6
S = 5742 RSq = 88.9% RSq(adj) = 87.9%
Analysis of Variance
Source DF SS MS F P
Regression 3 8941383570 2980461190 90.41 0.000
Error 34 1120826956 32965499
Total 37 10062210526
Source DF Seq SS
age 1 7937259218
plan fin 1 270764351
Hrs. of 1 733360001
Unusual Observations
Obs age Expected Fit StDev Fit Residual St Resid
8 29.0 125000 110091 2846 14909 2.99R
10 33.0 110000 116257 3415 6257 1.36 X
13 24.0 70000 80430 2798 10430 2.08R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
DurbinWatson statistic = 1.65
Residuals Versus the Order of the Data
(response is Expected)
Standardized Residual
3
8
2
1
0
-1
-2
5
10
15
20
25
Observation Order
30
35
Residuals Versus the Fitted Values
(response is Expected)
Standardized Residual
3
2
1
0
-1
-2
60000
70000
80000
90000
100000
110000
120000
Fitted Value
Normal Probability Plot of the Residuals
(response is Expected)
Standardized Residual
3
2
1
0
-1
-2
-2
-1
0
1
2
Normal Score
Regression Analysis
The regression equation is
Expected salary = 63050 + 4653 age 4558 plan fin. + 351 Hrs. of work
Predictor Coef StDev T P VIF
Constant 63050 8827 7.14 0.000
age 4652.8 415.5 11.20 0.000 1.6
plan fin 4558 1908 2.39 0.023 1.1
Hrs. of 351.10 94.81 3.70 0.001 1.7
S = 5004 RSq = 90.1% RSq(adj) = 89.2%
Analysis of Variance
Source DF SS MS F P
Regression 3 7482915093 2494305031 99.63 0.000
Error 33 826165988 25035333
Total 36 8309081081
Source DF Seq SS
age 1 7066013767
plan fin 1 73551726
Hrs. of 1 343349601
Unusual Observations
Obs age Expected Fit StDev Fit Residual St Resid
9 33.0 110000 115070 2996 5070 1.27 X
34 27.0 70000 80840 1093 10840 2.22R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
DurbinWatson statistic = 2.00
Residuals Versus the Fitted Values
(response is Expected)
Standardized Residual
2
1
0
-1
-2
60000
70000
80000
90000
100000
110000
120000
Fitted Value
Residuals Versus the Order of the Data
Standardized Residual
(response is Expected)
Heteroscadasticity is non—constant variance.
It appears that there is non-constant
variance the residuals versus
the
fitted
values.
But
to further explore that aspect, we
2
would have to do a Levenes’ test. Hopefully, the logged variables would take care of this.
1
Let’s do a regression with logged
expected salaries.
0
-1
-2
5
10
15
20
Observation Order
25
30
35
Normal Probability Plot of the Residuals
(response is Expected)
Leverage points should be less than 2.5*(p+1)/n, 2.5*(4+1)/37 =.35
One leverage point is about .35 (*).
Cook’s distance should be less than 1, which is true.
Regression Analysis
The regression equation is
logexp = 4.18 + 0.0231 age 0.0256 plan fin. + 0.00186 Hrs. of work
Predictor Coef StDev T P VIF
Constant 4.18204 0.04868 85.92 0.000
age 0.023056 0.002291 10.06 0.000 1.6
plan fin 0.02560 0.01052 2.43 0.021 1.1
Hrs. of 0.0018643 0.0005228 3.57 0.001 1.7
S = 0.02759 RSq = 88.3% RSq(adj) = 87.2%
Analysis of Variance
Source DF SS MS F P
Regression 3 0.188991 0.062997 82.74 0.000
Error 33 0.025126 0.000761
Total 36 0.214116
Source DF Seq SS
age 1 0.176882
plan fin 1 0.002427
Hrs. of 1 0.009681
Unusual Observations
Obs age logexp Fit StDev Fit Residual St Resid
8 25.0 4.90309 4.85166 0.01077 0.05143 2.02R
9 33.0 5.04139 5.07340 0.01652 0.03201 1.45 X
20 25.0 4.77815 4.83539 0.00838 0.05724 2.18R
34 27.0 4.84510 4.90015 0.00603 0.05505 2.04R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
DurbinWatson statistic = 2.34
Residuals Versus the Fitted Values
(response is logexp)
Standardized Residual
2
1
0
-1
-2
4.8
4.9
5.0
5.1
Fitted Value
It appears that there is still non-constant variance. The only reasonable thing to do now is
weighted least square.
Residuals Versus the Order of the Data
(response is logexp)
Standardized Residual
2
1
0
-1
-2
5
10
15
20
Observation Order
25
30
35
Normal Probability Plot of the Residuals
(response is logexp)