# Salary

Published on January 2017 | Categories: Documents | Downloads: 47 | Comments: 0 | Views: 480
of 18

## Content

Stern’s MBA 1 students expect to make the big bucks after graduation!
Data File: salary.doc
Stern’s school of business is a very reputable university. Students generally attend
this school not only to increase their knowledge base of business but also to increase their
salary. I am interested in finding out what leads MBA 1 student to believe that they going
to earn a certain salary after graduation. In other word, what factors affect the expected
salary of students? As I could not find enough meaningful data on the topic, I decided to
conduct my own survey. I was actually very surprised to see how cooperative students
were in filling my survey when I gave them candies in return!
Here are some descriptive statistics for all 40 MBA 1 surveyed. Variables include
expected salary after graduation (expected), age of the person (age), the number of year
of working experience in the chosen industry after graduation (num.of), if the person
plans to work in finance (plan fin), if the person plans to work in consulting (plan con), if
the person plans to work in marketing (plan mar), if the person plans to work in other
industries (plan oth), if the person is unsure about the industry (unsure), if the person
plans to work for a Fortune 500 Company (500 comp), the number of hours he/she plans
to work a week, and his/her past salary (past sal).
Descriptive Statistics
Variable        N     Mean   Median  Tr Mean    StDev  SE Mean
Expected       40    84250    80000    83333    17213     2722
age            40   26.825   26.000   26.694    2.490    0.394
num. of        40    1.725    0.000    1.500    2.449    0.387
plan fin       40   0.6750   1.0000   0.6944   0.4743   0.0750
plan con       40   0.2500   0.0000   0.2222   0.4385   0.0693
plan mar       40   0.0750   0.0000   0.0278   0.2667   0.0422
plan oth       40  0.00000  0.00000  0.00000  0.00000  0.00000
unsure         40  0.00000  0.00000  0.00000  0.00000  0.00000
500 comp       40   0.5000   0.5000   0.5000   0.5064   0.0801
Hrs. of        40    70.62    70.00    70.42    12.57     1.99
past sal       40    54250    48500    51667    20664     3267
Variable      Min      Max       Q1       Q3
Expected    60000   125000    70000    97500
age        23.000   33.000   25.000   29.000
num. of     0.000    8.000    0.000    3.000
plan fin   0.0000   1.0000   0.0000   1.0000
plan con   0.0000   1.0000   0.0000   0.7500
plan mar   0.0000   1.0000   0.0000   0.0000
plan oth  0.00000  0.00000  0.00000  0.00000
unsure    0.00000  0.00000  0.00000  0.00000
500 comp   0.0000   1.0000   0.0000   1.0000
Hrs. of     50.00    95.00    60.00    80.00
past sal    30000   125000    40000    60000

At a first look at the data, there are different things happening. None of the 40
students claim to be unsure about which industry they want to go in after graduation and
none of them are planning to work in an industry other than finance, consulting or

marketing. Moreover, the average expected salary is \$84,250 is much higher than the
actual salary of graduating MBA students in 1996 of \$70,000 (footnote). This
discrepancy could mean that MBA 1 students are rather optimistic.
Salaries are often right tailed. Let’s check the distribution of both expected and
past salaries.

9

10

8
6

Frequency

Frequency

7
5
4
3

5

2
1
0
60000 70000 80000 90000 100000110000120000130000

0
30000 40000 50000 60000 70000 80000 90000 100000

Expected salary

past salary

These distributions of salaries are right tailed. Thus, it might be helpful to log
salaries.

8

10

7

Frequency

Frequency

6
5
4
3

5

2
1
0

0
4.8

4.9

5.0

5.1

4.48 4.53 4.58 4.63 4.68 4.73 4.78 4.83 4.88 4.93 4.98

log expected

log past

Let’s check potential outliers in both expected and past salaries:

130000

70000

120000

Expected salary

past salary

120000

110000
100000
90000
80000
70000

20000

60000

130000

130000

120000

120000

Expected salary

Expected salary

There are apparently 3 outliers in the past salaries observations.
Now, let’s look at the distribution of the different industries students plan to work in.

110000
100000
90000
80000
70000

110000
100000
90000
80000
70000

60000

60000
0

1

0

plan fin.

1

plan cons.

130000

Expected salary

120000
110000
100000
90000
80000
70000
60000
0

1

plan mark.

There are only 2 students out of 40 who plan to work in the marketing industry. This
variable apparently has a low significance. Students who plan to work in finance are
coded by 1. It is interesting to see that students who plan to work in finance and the ones
who do not actually have the same median expected salary (\$80,000). It looks like
students who are planning to go in finance and those who are planning to go to consulting
are negatively correlated.

Regression Analysis
* plan mark. is highly correlated with other X variables
* plan mark. has been removed from the equation
* plan other has all values = 0
* plan other has been removed from the equation
* unsure has all values = 0
* unsure has been removed from the equation

The regression equation is
Expected salary = ­ 44885 + 3352 age + 647 num. of yrs. of exp.

­ 2591 plan fin. + 4025 plan cons. + 3171 500 comp?
+ 432 Hrs. of work + 0.125 past salary
Predictor       Coef       StDev          T        P       VIF
Constant      ­44885       18068      ­2.48    0.018
age           3351.8       849.5       3.95    0.000       3.6
num. of        647.5       514.7       1.26    0.217       1.3
plan fin       ­2591        5082      ­0.51    0.614       4.6
plan con        4025        5344       0.75    0.457       4.4
500 comp        3171        2887       1.10    0.280       1.7
Hrs. of        431.6       122.2       3.53    0.001       1.9
past sal     0.12506     0.09684       1.29    0.206       3.2
S = 6999        R­Sq = 86.4%     R­Sq(adj) = 83.5%
Analysis of Variance
Source       DF          SS          MS         F        P
Regression    7  9988126411  1426875202     29.13    0.000
Error        32  1567373589    48980425
Total        39 11555500000
Source       DF      Seq SS
age           1  8185071089
num. of       1    75586010
plan fin      1   531542752
plan con      1   158051549
500 comp      1    84132640
Hrs. of       1   872058157
past sal      1    81684215
Unusual Observations
Obs       age   Expected        Fit  StDev Fit   Residual    St Resid
16      25.0      70000      86541       3156     ­16541      ­2.65R
34      27.0     120000     101312       3493      18688       3.08R
R denotes an observation with a large standardized residual
Durbin­Watson statistic = 1.64

We can see that Minitab directly get rid of 3 variables. These variables are students
planning to work in marketing, planning to work in other industries and students who are
unsure about for which industry they will be working. I would also remove students who
plan consulting because it is highly negatively correlated with students who plan finance.
The overall regression is statistically significant. However, some variables have P-values
over.05.
Residuals Versus the Fitted Values
(response is Expected)

Residuals Versus the Order of the Data
(response is Expected)
3

Standardized Residual

Standardized Residual

3
2
1
0
-1
-2

2
1
0
-1
-2
-3

-3
5

10

15

20

25

Observation Order

30

35

40

55000

65000

75000

85000

95000

Fitted Value

105000

115000

125000

Normal Probability Plot of the Residuals

Histogram of the Residuals

(response is Expected)

(response is Expected)
10

2
1

Frequency

Standardized Residual

3

0
-1

5

-2
-3
-2

-1

0

1

2

0

Normal Score

-3

-2

-1

0

1

2

3

Standardized Residual

The distribution of the residual looks normal. However, we can notice couples of outliers.
Now, let’s try a regression with logged salaries for past and expected while keeping the
same variables.
Regression Analysis
* plan mark. is highly correlated with other X variables
* plan mark. has been removed from the equation
* plan other has all values = 0
* plan other has been removed from the equation
* unsure has all values = 0
* unsure has been removed from the equation

The regression equation is
log expected = 4.26 + 0.0178 age + 0.00289 num. of yrs. of exp.
­ 0.0137 plan fin. + 0.0190 plan cons. + 0.0125 500 comp?
+ 0.00202 Hrs. of work +0.000001 past salary
Predictor       Coef       StDev          T        P       VIF
Constant     4.26054     0.08967      47.52    0.000
age         0.017759    0.004216       4.21    0.000       3.6
num. of     0.002887    0.002554       1.13    0.267       1.3
plan fin    ­0.01366     0.02522      ­0.54    0.592       4.6
plan con     0.01899     0.02652       0.72    0.479       4.4
500 comp     0.01251     0.01433       0.87    0.389       1.7

Hrs. of    0.0020215   0.0006063       3.33    0.002       1.9
past sal  0.00000057  0.00000048       1.18    0.246       3.2
S = 0.03473     R­Sq = 86.2%     R­Sq(adj) = 83.2%
Analysis of Variance
Source       DF          SS          MS         F        P
Regression    7    0.241162    0.034452     28.56    0.000
Error        32    0.038600    0.001206
Total        39    0.279762
Source       DF      Seq SS
age           1    0.201723
num. of       1    0.001471
plan fin      1    0.012164
plan con      1    0.003731
500 comp      1    0.001379
Hrs. of       1    0.019007
past sal      1    0.001687
Unusual Observations
Obs       age   log expe        Fit  StDev Fit   Residual    St Resid
16      25.0    4.84510    4.92668    0.01566   ­0.08158      ­2.63R
34      27.0    5.07918    4.99767    0.01733    0.08151       2.71R
R denotes an observation with a large standardized residual
Durbin­Watson statistic = 2.01

Logging the salaries does not change the regression model significantly. Thus, I will keep
the antilog data.
Let’s see how the regression looks without the plan consulting variable.

Regression Analysis
The regression equation is
Expected salary = ­ 46346 + 3477 age + 574 num. of yrs. of exp.
­ 5884 plan fin. + 2553 500 comp? + 466 Hrs. of work
+ 0.113 past salary
Predictor       Coef       StDev          T        P       VIF
Constant      ­46346       17846      ­2.60    0.014
age           3476.8       827.6       4.20    0.000       3.4
num. of        573.8       501.9       1.14    0.261       1.2
plan fin       ­5884        2573      ­2.29    0.029       1.2
500 comp        2553        2750       0.93    0.360       1.6
Hrs. of        465.9       112.6       4.14    0.000       1.6
past sal     0.11303     0.09489       1.19    0.242       3.1
S = 6953        R­Sq = 86.2%     R­Sq(adj) = 83.7%

Analysis of Variance
Source       DF          SS          MS         F        P
Regression    6  9960345863  1660057644     34.34    0.000
Error        33  1595154137    48338004
Total        39 11555500000
Source       DF      Seq SS
age           1  8185071089
num. of       1    75586010
plan fin      1   531542752
500 comp      1    22521046
Hrs. of       1  1077031091
past sal      1    68593876
Unusual Observations
Obs       age   Expected        Fit  StDev Fit   Residual    St Resid
10      33.0     110000     111467       5049      ­1467      ­0.31 X
16      25.0      70000      86410       3131     ­16410      ­2.64R
34      27.0     120000     101124       3461      18876       3.13R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Durbin­Watson statistic = 1.67
Histogram of the Residuals

Residuals Versus the Fitted Values

(response is Expected)

(response is Expected)

9

3

34

8

Standardized Residual

Frequency

7
6
5
4
3
2
1

2

1
0
-1
-2

0
-3

-2

-1

0

1

2

16

-3

3

55000

Standardized Residual

65000

75000

85000

95000

105000

115000

125000

Fitted Value

Normal Probability Plot of the Residuals
(response is Expected)

3

Standardized Residual

34
2

1
0
-1
-2

16

-3
-2

-1

0

1

2

Normal Score

Not the outliers are still here. Now, let’s run a best subset regression to find out what variables are best to
choose for our model.

Best Subsets Regression

Response is Expected
p 5   p
n l 0 H a
u a 0 r s
m n   s t
.   c .
a   f o   s
R­Sq                    g o i m o a
Vars   R­Sq   (adj)   C­p         S   e f n p f l
1   70.8   70.1   33.7    9417.8   X
1   57.8   56.7   64.8     11325             X
2   81.2   80.2   10.9    7660.7   X       X
2   75.5   74.1   24.6    8752.2   X   X
3   84.7   83.4    4.6    7012.4   X   X   X
3   83.2   81.8    8.1    7334.3   X     X X
4   85.3   83.6    5.2    6971.0   X   X X X
4   85.2   83.5    5.5    7000.3   X   X   X X
5   85.8   83.8    5.9    6938.4   X X X   X X
5   85.6   83.5    6.3    6983.8   X   X X X X
6   86.2   83.7    7.0    6952.6   X X X X X X

My choice is between the two possibilities in bold. One reason is that they have small S
and relatively high R-sq. Another reason is that C-p should be approximately P+1 =6.
Thus, I picked the one that has a C-p of 5.5 and a S of 7000.3. Here is the new regression:
Regression Analysis
The regression equation is
Expected salary = ­ 57732 + 4034 age ­ 6828 plan fin. + 0.0975 past
salary
+ 468 Hrs. of work
Predictor       Coef       StDev          T        P       VIF
Constant      ­57732       16415      ­3.52    0.001
age           4034.1       753.2       5.36    0.000       2.8
plan fin       ­6828        2392      ­2.85    0.007       1.0
past sal     0.09755     0.09198       1.06    0.296       2.9
Hrs. of        468.5       111.7       4.19    0.000       1.6
S = 7000        R­Sq = 85.2%     R­Sq(adj) = 83.5%
Analysis of Variance
Source       DF          SS          MS         F        P
Regression    4  9840364419  2460091105     50.20    0.000
Error        35  1715135581    49003874
Total        39 11555500000
Source       DF      Seq SS
age           1  8185071089
plan fin      1   536172676
past sal      1   257832450

Hrs. of       1   861288204
Unusual Observations
Obs       age   Expected        Fit  StDev Fit   Residual    St Resid
8      29.0     125000     111564       2946      13436       2.12R
10      33.0     110000     115013       4497      ­5013      ­0.93 X
16      25.0      70000      87329       2846     ­17329      ­2.71R
34      27.0     120000     101545       3129      18455       2.95R
39      30.0     100000     107987       4449      ­7987      ­1.48 X
40      31.0     110000     114851       4466      ­4851      ­0.90 X
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Durbin­Watson statistic = 1.57

Past salaries still have a P-value above .05. So, I decide to take this variable out of the
regression.
Regression Analysis
The regression equation is
Expected salary = ­ 69455 + 4583 age ­ 6843 plan fin. + 501 Hrs. of work
Predictor       Coef       StDev          T        P
Constant      ­69455       12157      ­5.71    0.000
age           4583.5       547.7       8.37    0.000
plan fin       ­6843        2396      ­2.86    0.007
Hrs. of        500.9       107.7       4.65    0.000
S = 7012        R­Sq = 84.7%     R­Sq(adj) = 83.4%
Analysis of Variance
Source       DF          SS          MS         F        P
Regression    3  9785248786  3261749595     66.33    0.000
Error        36  1770251214    49173645
Total        39 11555500000
Source       DF      Seq SS
age           1  8185071089
plan fin      1   536172676
Hrs. of       1  1064005021
Durbin­Watson statistic = 1.64 no autocorrelation. It confirms the
residual vrs order plot
Unusual Observations
Obs       age   Expected        Fit  StDev Fit   Residual    St Resid
8      29.0     125000     111047       2910      13953       2.19R
10      33.0     110000     116859       4154      ­6859      ­1.21 X
16      25.0      70000      87704       2829     ­17704      ­2.76R
34      27.0     120000     101880       3118      18120       2.88R
R denotes an observation with a large standardized residual

X denotes an observation whose X value gives it large influence.

Residuals Versus the Order of the Data
(response is Expected)
3

34

1

0

Residuals Versus the Fitted Values

-1

(response is Expected)

-2

3

16

-3
5

Standardized Residual

2

10

15

34

20

25

30

35

40

Observation Order
1
0
-1
-2

16

-3
50000

60000

70000

80000

90000

100000

110000

120000

Fitted Value

Normal Probability Plot of the Residuals
(response is Expected)
3

Standardized Residual

2
1
0
-1
-2
-3
-2

-1

0

1

2

Histogram of the Residuals
Normal Score

(response is Expected)
10

Frequency

Standardized Residual

2

5

0
-3

-2

-1

0

1

Standardized Residual

2

3

Now we have a statistically significant model with P-value below .05. However, two outliers are still
visible in the residuals plots. We can try to get ride of these 2 oultiers (observation 34 and 16).

Regression Analysis
The regression equation is
Expected salary = ­ 65739 + 4508 age ­ 6676 plan fin. + 475 Hrs. of work
Predictor       Coef       StDev          T        P       VIF
Constant      ­65739       10088      ­6.52    0.000
age           4508.2       474.3       9.51    0.000       1.6
plan fin       ­6676        2071      ­3.22    0.003       1.0
Hrs. of        474.7       100.6       4.72    0.000       1.6
S = 5742        R­Sq = 88.9%     R­Sq(adj) = 87.9%
Analysis of Variance
Source       DF          SS          MS         F        P
Regression    3  8941383570  2980461190     90.41    0.000
Error        34  1120826956    32965499
Total        37 10062210526
Source       DF      Seq SS
age           1  7937259218
plan fin      1   270764351
Hrs. of       1   733360001
Unusual Observations
Obs       age   Expected        Fit  StDev Fit   Residual    St Resid
8      29.0     125000     110091       2846      14909       2.99R
10      33.0     110000     116257       3415      ­6257      ­1.36 X
13      24.0      70000      80430       2798     ­10430      ­2.08R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Durbin­Watson statistic = 1.65
Residuals Versus the Order of the Data
(response is Expected)

Standardized Residual

3

8

2

1

0

-1

-2
5

10

15

20

25

Observation Order

30

35

Residuals Versus the Fitted Values
(response is Expected)

Standardized Residual

3

2

1

0

-1

-2
60000

70000

80000

90000

100000

110000

120000

Fitted Value

Normal Probability Plot of the Residuals
(response is Expected)

Standardized Residual

3

2

1

0

-1

-2
-2

-1

0

1

2

Normal Score

Regression Analysis
The regression equation is
Expected salary = ­ 63050 + 4653 age ­ 4558 plan fin. + 351 Hrs. of work
Predictor       Coef       StDev          T        P       VIF

Constant      ­63050        8827      ­7.14    0.000
age           4652.8       415.5      11.20    0.000       1.6
plan fin       ­4558        1908      ­2.39    0.023       1.1
Hrs. of       351.10       94.81       3.70    0.001       1.7
S = 5004        R­Sq = 90.1%     R­Sq(adj) = 89.2%
Analysis of Variance
Source       DF          SS          MS         F        P
Regression    3  7482915093  2494305031     99.63    0.000
Error        33   826165988    25035333
Total        36  8309081081
Source       DF      Seq SS
age           1  7066013767
plan fin      1    73551726
Hrs. of       1   343349601
Unusual Observations
Obs       age   Expected        Fit  StDev Fit   Residual    St Resid
9      33.0     110000     115070       2996      ­5070      ­1.27 X
34      27.0      70000      80840       1093     ­10840      ­2.22R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Durbin­Watson statistic = 2.00
Residuals Versus the Fitted Values
(response is Expected)

Standardized Residual

2

1

0

-1

-2

60000

70000

80000

90000

100000

110000

120000

Fitted Value

Residuals Versus the Order of the Data

Standardized Residual

(response is Expected)
Heteroscadasticity is non—constant variance.
It appears that there is non-constant
variance the residuals versus
the
fitted
values.
But
to further explore that aspect, we
2
would have to do a Levenes’ test. Hopefully, the logged variables would take care of this.
1
Let’s do a regression with logged
expected salaries.

0

-1

-2

5

10

15

20

Observation Order

25

30

35

Normal Probability Plot of the Residuals
(response is Expected)

Standardized Residual

2

1

0

-1

-2

-2

-1

0

1

2

Normal Score

Histogram of the Residuals
(response is Expected)

7
6

Frequency

5
4
3
2
1
0
-2.0

-1.5

-1.0

-0.5

0.0

0.5

Standardized Residual

Let’s check residuals, leverage points and cook’s distance.
SRES5 HI5
COOK5
0.59872 0.079655
0.007756
0.27269 0.116828
0.002459
-0.013530.076361
0.000004
0.97156 0.103741
0.027315
-1.265100.044574
0.018667
1.15005 0.063633
0.022470
1.27035 0.105834
0.047753
1.99149 0.152316
0.178160
-1.265240.358517*
0.223670
-0.731940.143014
0.022351

1.0

1.5

2.0

-0.156380.058155
-1.584370.284480
-0.915330.063633
0.76627 0.063603
0.23355 0.128017
-1.449700.231559
-0.751440.054100
-0.391980.079616
1.54343 0.100642
-1.683070.092249
0.94590 0.060364
-0.013530.076361
0.35320 0.063603
-0.305420.095103
0.59163 0.084579
1.66542 0.197385
0.21357 0.105834
-1.202020.170243
0.04569 0.064749
0.42040 0.043466
0.37278 0.145178
1.33865 0.097529
-0.915330.063633
-2.220130.047736
-0.156380.058155
-0.381650.091066
0.38048 0.134487

0.000378
0.249509
0.014234
0.009971
0.002002
0.158323
0.008074
0.003323
0.066644
0.071967
0.014370
0.000004
0.002118
0.002451
0.008085
0.170528
0.001350
0.074111
0.000036
0.002008
0.005900
0.048414
0.014234
0.061771
0.000378
0.003648
0.005624

Leverage points should be less than 2.5*(p+1)/n, 2.5*(4+1)/37 =.35
One leverage point is about .35 (*).
Cook’s distance should be less than 1, which is true.

Regression Analysis
The regression equation is
logexp = 4.18 + 0.0231 age ­ 0.0256 plan fin. + 0.00186 Hrs. of work
Predictor       Coef       StDev          T        P       VIF
Constant     4.18204     0.04868      85.92    0.000
age         0.023056    0.002291      10.06    0.000       1.6
plan fin    ­0.02560     0.01052      ­2.43    0.021       1.1
Hrs. of    0.0018643   0.0005228       3.57    0.001       1.7
S = 0.02759     R­Sq = 88.3%     R­Sq(adj) = 87.2%
Analysis of Variance
Source       DF          SS          MS         F        P
Regression    3    0.188991    0.062997     82.74    0.000
Error        33    0.025126    0.000761
Total        36    0.214116
Source       DF      Seq SS

age           1    0.176882
plan fin      1    0.002427
Hrs. of       1    0.009681
Unusual Observations
Obs       age     logexp        Fit  StDev Fit   Residual    St Resid
8      25.0    4.90309    4.85166    0.01077    0.05143       2.02R
9      33.0    5.04139    5.07340    0.01652   ­0.03201      ­1.45 X
20      25.0    4.77815    4.83539    0.00838   ­0.05724      ­2.18R
34      27.0    4.84510    4.90015    0.00603   ­0.05505      ­2.04R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
Durbin­Watson statistic = 2.34
Residuals Versus the Fitted Values
(response is logexp)

Standardized Residual

2

1

0

-1

-2
4.8

4.9

5.0

5.1

Fitted Value

It appears that there is still non-constant variance. The only reasonable thing to do now is
weighted least square.

Residuals Versus the Order of the Data
(response is logexp)

Standardized Residual

2

1

0

-1

-2
5

10

15

20

Observation Order

25

30

35

Normal Probability Plot of the Residuals
(response is logexp)

Standardized Residual

2

1

0

-1

-2
-2

-1

0

1

2

Normal Score

Histogram of the Residuals
(response is logexp)

8
7

Frequency

6
5
4
3
2
1
0
-2.0

-1.5

-1.0

-0.5

0.0

0.5

Standardized Residual

1.0

1.5

2.0

## Recommended

#### salary

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close