1
00:00:02,760 --> 00:00:04,990
- [Instructor] Hello, and
welcome to the video lecture

2
00:00:04,990 --> 00:00:06,863
on dummy variables.

3
00:00:10,710 --> 00:00:13,920
So far, all of our variables,

4
00:00:13,920 --> 00:00:16,340
all the ones that we've worked for,

5
00:00:16,340 --> 00:00:20,230
have been numerical
for the most part ratio

6
00:00:20,230 --> 00:00:23,543
but can be expressed as a true number.

7
00:00:25,200 --> 00:00:27,810
But many important attributes

8
00:00:27,810 --> 00:00:31,500
that we use in social science

9
00:00:31,500 --> 00:00:35,700
and other kinds of investigations
are nominal or ordinal.

10
00:00:35,700 --> 00:00:40,320
Thinking about marketing research,

11
00:00:40,320 --> 00:00:43,430
you might wanna think about
firmographic, demographic,

12
00:00:43,430 --> 00:00:45,510
geographic, psychographic,

13
00:00:45,510 --> 00:00:50,510
all of these may need a
nominal or ordinal variable

14
00:00:53,410 --> 00:00:54,630
to express them.

15
00:00:54,630 --> 00:00:57,763
So such as do you live in Vermont?

16
00:00:59,410 --> 00:01:02,524
Or your gender

17
00:01:02,524 --> 00:01:04,320
or your race

18
00:01:04,320 --> 00:01:08,330
or your attitude about certain things.

19
00:01:08,330 --> 00:01:13,330
So it's important to have
the ability to express these

20
00:01:16,000 --> 00:01:20,083
and to be able to work with
these in regression analysis.

21
00:01:27,150 --> 00:01:32,150
The way that we address this
is to create one or more binary

22
00:01:33,810 --> 00:01:36,453
or so-called dummy variables.

23
00:01:37,570 --> 00:01:42,570
So these can be expressed
such as zero, one.

24
00:01:44,360 --> 00:01:47,470
One if you are that
thing, zero if you're not.

25
00:01:47,470 --> 00:01:49,733
So Vermont resident.

26
00:01:56,430 --> 00:01:58,570
And for me, it makes more sense

27
00:01:58,570 --> 00:02:03,570
to do it like this than to
call it residents or state.

28
00:02:04,160 --> 00:02:07,890
Because if you call it Vermont resident

29
00:02:07,890 --> 00:02:10,960
then you're sure what the one stands for,

30
00:02:10,960 --> 00:02:12,260
where if it's state

31
00:02:12,260 --> 00:02:14,723
it's like I don't know
what state is a one.

32
00:02:16,600 --> 00:02:19,130
Certified organic.

33
00:02:19,130 --> 00:02:23,038
One of my favorites, of
course, is full professor.

34
00:02:23,038 --> 00:02:27,380
And,

35
00:02:27,380 --> 00:02:29,360
gender may be expressed

36
00:02:29,360 --> 00:02:33,320
by coding somebody as
one if they're a female

37
00:02:33,320 --> 00:02:36,920
and zero if they're not.

38
00:02:36,920 --> 00:02:38,460
In some cases,

39
00:02:38,460 --> 00:02:40,960
a series of them may be needed.

40
00:02:40,960 --> 00:02:43,400
If there's more than one state,

41
00:02:43,400 --> 00:02:45,390
more than, Vermont or not,

42
00:02:45,390 --> 00:02:47,660
but you might wanna also say,

43
00:02:47,660 --> 00:02:52,660
do you live in New
Hampshire, Maine, et cetera?

44
00:02:52,810 --> 00:02:57,810
You may need to express
gender as more than two.

45
00:02:58,040 --> 00:02:59,870
So you could have male, female

46
00:02:59,870 --> 00:03:03,350
and a third such as non-binary

47
00:03:03,350 --> 00:03:06,483
or none of these.

48
00:03:12,920 --> 00:03:16,480
So focusing a bit on gender.

49
00:03:16,480 --> 00:03:19,023
This can be the single,

50
00:03:20,750 --> 00:03:24,130
using a single dummy in a regression

51
00:03:24,130 --> 00:03:28,633
such as how does being female impact wage?

52
00:03:32,828 --> 00:03:35,773
Do females own or earn more or less

53
00:03:36,790 --> 00:03:41,790
or the same as other genders?

54
00:03:42,010 --> 00:03:44,090
So, and you might wanna think about,

55
00:03:44,090 --> 00:03:49,090
do Vermont resident students
have higher or lower

56
00:03:50,940 --> 00:03:54,923
or the same GPA at UVM

57
00:03:55,770 --> 00:04:00,063
as students that are not from Vermont.

58
00:04:01,010 --> 00:04:03,770
So to ask the first question,

59
00:04:03,770 --> 00:04:08,653
we have a very simple model
here with two regressors,

60
00:04:09,600 --> 00:04:11,930
female and education.

61
00:04:11,930 --> 00:04:14,740
So how many years of education

62
00:04:16,890 --> 00:04:21,330
and the effect of being female on wage.

63
00:04:21,330 --> 00:04:25,640
You can think about
controlling for gender.

64
00:04:25,640 --> 00:04:27,130
What's the effect of education?

65
00:04:27,130 --> 00:04:29,740
And controlling for education,

66
00:04:29,740 --> 00:04:31,723
what is the effect of gender?

67
00:04:32,840 --> 00:04:36,550
And note that the Wooldridge
books use this funky D,

68
00:04:36,550 --> 00:04:40,500
this delta D for dummy variables.

69
00:04:40,500 --> 00:04:45,133
And I'm going to continue
with this convention.

70
00:04:48,380 --> 00:04:49,880
Here's our model wage.

71
00:04:49,880 --> 00:04:52,780
We have female and we have education.

72
00:04:52,780 --> 00:04:57,780
And the best way to think
about the value of this delta

73
00:04:59,630 --> 00:05:03,343
is to think of it as an intercept shift.

74
00:05:04,280 --> 00:05:08,130
So depending on its magnitude,

75
00:05:08,130 --> 00:05:13,130
if it's greater than
zero or less than zero,

76
00:05:13,220 --> 00:05:16,430
it's going to shift the
intercept up and down.

77
00:05:16,430 --> 00:05:18,483
So you can think about it as,

78
00:05:20,010 --> 00:05:23,940
let's assume, or let's hypothesize

79
00:05:23,940 --> 00:05:28,570
that delta naught is less than zero.

80
00:05:28,570 --> 00:05:30,010
That all else equal,

81
00:05:30,010 --> 00:05:34,723
being female has a
negative effect on wages.

82
00:05:36,400 --> 00:05:40,420
Which I think, historically,
has largely been true.

83
00:05:40,420 --> 00:05:44,063
So you can see it as,

84
00:05:45,740 --> 00:05:50,683
that delta naught is,

85
00:05:53,440 --> 00:05:56,650
the expected value of wage

86
00:05:56,650 --> 00:06:00,740
given that someone is female

87
00:06:00,740 --> 00:06:03,860
and given some amount of education,

88
00:06:03,860 --> 00:06:08,490
or it's the difference,

89
00:06:08,490 --> 00:06:12,120
this last point of the expected wage

90
00:06:12,120 --> 00:06:16,490
for a given amount of education

91
00:06:16,490 --> 00:06:19,090
between females and non-females.

92
00:06:19,090 --> 00:06:22,740
And when we talk about this in class

93
00:06:22,740 --> 00:06:25,420
I will draw a simple graph.

94
00:06:25,420 --> 00:06:26,740
But you can think of it

95
00:06:26,740 --> 00:06:31,200
as is it actually is an intercept shift.

96
00:06:31,200 --> 00:06:34,700
So in this example,

97
00:06:37,130 --> 00:06:42,130
the Y intercept for
non-females is beta naught

98
00:06:43,070 --> 00:06:46,580
and the intercept for females

99
00:06:46,580 --> 00:06:48,820
is beta naught plus delta naught.

100
00:06:48,820 --> 00:06:52,553
So it really does just
shift the intercept.

101
00:06:55,780 --> 00:06:58,990
One thing that you have
to be very careful of,

102
00:06:58,990 --> 00:07:03,563
is to not fall into the
dummy variable trap.

103
00:07:04,690 --> 00:07:08,180
Now, of course, this is
a well-known phenomenon.

104
00:07:08,180 --> 00:07:13,067
First discovered a long time
ago in a galaxy far, far away

105
00:07:15,270 --> 00:07:17,090
by Admiral Ackbar.

106
00:07:17,090 --> 00:07:20,140
But the bottom line is,

107
00:07:20,140 --> 00:07:25,020
you always have to have
an omitted category.

108
00:07:25,020 --> 00:07:29,410
So let's say that you're
interested again in two groups,

109
00:07:29,410 --> 00:07:32,207
Vermont residents, one,

110
00:07:32,207 --> 00:07:36,290
and residents of other states.

111
00:07:36,290 --> 00:07:38,050
And you've named them

112
00:07:38,050 --> 00:07:38,923
and,

113
00:07:40,570 --> 00:07:45,570
everyone is either a one or zero.

114
00:07:45,720 --> 00:07:47,520
So if you live in Vermont,

115
00:07:47,520 --> 00:07:51,120
you're a one for Vermont res

116
00:07:51,120 --> 00:07:53,740
and a zero for other res.

117
00:07:53,740 --> 00:07:55,960
And if you live in New York,

118
00:07:55,960 --> 00:08:00,170
say, you're a zero for Vermont res

119
00:08:00,170 --> 00:08:03,670
and a one for other res.

120
00:08:03,670 --> 00:08:05,320
So why not use both?

121
00:08:05,320 --> 00:08:08,600
Why not put them both into the regression?

122
00:08:08,600 --> 00:08:10,763
What's the problem here?

123
00:08:11,810 --> 00:08:13,750
I'll let you think
about that for a moment.

124
00:08:13,750 --> 00:08:16,123
Feel free to pause and think about it.

125
00:08:17,150 --> 00:08:21,403
And then I will tell you.

126
00:08:22,730 --> 00:08:27,020
So the problem is they always add to one.

127
00:08:27,020 --> 00:08:29,480
They are perfectly collinear.

128
00:08:29,480 --> 00:08:33,690
So adding up Vermont for any individual,

129
00:08:33,690 --> 00:08:38,690
Vermont res plus other res
equals one for any individual.

130
00:08:39,860 --> 00:08:42,530
So you can only include one.

131
00:08:42,530 --> 00:08:45,900
And the other way of thinking about it is,

132
00:08:45,900 --> 00:08:49,060
if you include Vermont res,

133
00:08:49,060 --> 00:08:53,830
the variable of other res
adds no new information.

134
00:08:53,830 --> 00:08:57,610
And actually, the
regression will fall apart

135
00:08:57,610 --> 00:09:00,720
because we have perfect collinearity.

136
00:09:00,720 --> 00:09:04,610
So you always have to have
at least one omitted group.

137
00:09:04,610 --> 00:09:06,540
And that the way that you code that

138
00:09:08,000 --> 00:09:11,840
will determine what is your omitted group.

139
00:09:11,840 --> 00:09:13,810
So in this case,

140
00:09:13,810 --> 00:09:18,110
this Vermont res would be,
what is the intercept shift,

141
00:09:18,110 --> 00:09:21,160
or what is the effect
of a Vermont resident

142
00:09:21,160 --> 00:09:24,840
on our Y holding all else equal?

143
00:09:24,840 --> 00:09:29,840
And in the comparison is those
who live in another state.

144
00:09:31,180 --> 00:09:33,293
And I hopefully this all makes sense.

145
00:09:37,880 --> 00:09:40,530
Again, we can see this graphically

146
00:09:40,530 --> 00:09:44,790
as being literally an intercept shift.

147
00:09:44,790 --> 00:09:46,370
Note that this, I know,

148
00:09:46,370 --> 00:09:49,120
this graph I just got it from the Web.

149
00:09:49,120 --> 00:09:54,120
And it assumes that that everyone
is either female or male.

150
00:09:54,590 --> 00:09:58,180
And of course that may or may not be true.

151
00:09:58,180 --> 00:10:03,180
But just you can think
of, on the graph here,

152
00:10:04,570 --> 00:10:08,530
what it says male would
just be non-females.

153
00:10:08,530 --> 00:10:11,810
So here, non-females are the base group,

154
00:10:11,810 --> 00:10:14,260
the benchmark or the omitted group.

155
00:10:14,260 --> 00:10:19,260
So their y-intercept is just whatever,

156
00:10:20,490 --> 00:10:22,460
alpha naught here or beta naught,

157
00:10:22,460 --> 00:10:25,810
as we have been denoting it.

158
00:10:25,810 --> 00:10:29,270
And the intercept for
females is beta naught

159
00:10:29,270 --> 00:10:31,590
plus delta naught,

160
00:10:31,590 --> 00:10:36,060
or here in the graph, alpha
naught plus delta naught.

161
00:10:36,060 --> 00:10:39,410
And you can see in this case,

162
00:10:39,410 --> 00:10:44,380
the coefficient delta
naught is less than zero.

163
00:10:44,380 --> 00:10:49,380
That the effect of female
on wages is negative.

164
00:10:50,520 --> 00:10:54,373
So holding education constant,

165
00:10:57,270 --> 00:10:58,863
in this case,

166
00:10:59,920 --> 00:11:03,560
the effect of female is negative.

167
00:11:03,560 --> 00:11:07,590
So we would predict on average,

168
00:11:07,590 --> 00:11:10,920
given an amount of education,

169
00:11:10,920 --> 00:11:12,410
that a female's wage

170
00:11:12,410 --> 00:11:16,053
would be delta naught less
than a non-female's wage.

171
00:11:20,270 --> 00:11:23,660
You can of course add
many other regressors.

172
00:11:23,660 --> 00:11:27,790
You may, not only education,
but you might have experience,

173
00:11:27,790 --> 00:11:30,703
how long have you been in
the workforce in general,

174
00:11:32,626 --> 00:11:33,459
and you'll have

175
00:11:33,459 --> 00:11:37,270
how long have they been with
this particular company?

176
00:11:42,063 --> 00:11:43,680
You could add a bunch of others.

177
00:11:43,680 --> 00:11:47,143
But how would you test
for gender discrimination?

178
00:11:49,310 --> 00:11:52,060
Basically, as says up here,

179
00:11:52,060 --> 00:11:56,970
it would be a simple
T-test on our delta naught.

180
00:11:56,970 --> 00:12:01,150
And I will let you ponder
what would be the advantages

181
00:12:01,150 --> 00:12:05,210
or disadvantages of a
one or a two-tailed test.

182
00:12:05,210 --> 00:12:07,060
So think back to last week,

183
00:12:07,060 --> 00:12:12,060
what would be the advantages
to a one or two-tail test,

184
00:12:12,640 --> 00:12:15,263
and we will discuss it in class.

185
00:12:21,860 --> 00:12:24,370
But basically, what this analysis does is,

186
00:12:24,370 --> 00:12:26,740
holding all else equal,

187
00:12:26,740 --> 00:12:30,623
education, experience, et cetera,

188
00:12:31,660 --> 00:12:34,560
do females earn lower ages?

189
00:12:34,560 --> 00:12:39,560
If you could think about other
examples that you might use.

190
00:12:39,780 --> 00:12:42,540
Do Vermont students do better in class?

191
00:12:42,540 --> 00:12:44,330
That's one we thought of.

192
00:12:44,330 --> 00:12:49,330
Does owning a computer improve your GPA.

193
00:12:51,210 --> 00:12:54,420
And I think you could think
of a lot of other examples

194
00:12:54,420 --> 00:12:58,190
where you would have a
dummy variable as some,

195
00:12:58,190 --> 00:13:01,883
just to test some kind of a policy.

196
00:13:03,270 --> 00:13:06,910
And in some cases it might make sense

197
00:13:06,910 --> 00:13:11,210
such as give half the students computer

198
00:13:11,210 --> 00:13:13,930
and not give them half.

199
00:13:13,930 --> 00:13:16,490
That you could see the omitted group,

200
00:13:16,490 --> 00:13:18,570
those that are a zero,

201
00:13:18,570 --> 00:13:23,480
didn't get a computer as a control group.

202
00:13:23,480 --> 00:13:28,480
And the ones the got a computer
would be the experimental

203
00:13:29,880 --> 00:13:31,423
or the treatment group.

204
00:13:36,900 --> 00:13:38,160
There may be times

205
00:13:38,160 --> 00:13:42,763
when you want to include
more than one dummy variable.

206
00:13:46,220 --> 00:13:47,550
For example,

207
00:13:47,550 --> 00:13:50,550
does marital status have an effect?

208
00:13:50,550 --> 00:13:55,550
So that would be holding
your gender constant

209
00:13:55,930 --> 00:13:58,953
and all the other factors, education,

210
00:14:00,850 --> 00:14:04,893
does your marital status impact your wage?

211
00:14:06,780 --> 00:14:08,950
One of the things that you could do

212
00:14:08,950 --> 00:14:11,510
is that the sort of simplest thing,

213
00:14:11,510 --> 00:14:14,930
would just be to include one that says,

214
00:14:14,930 --> 00:14:19,310
say, married, a dominant variable married.

215
00:14:19,310 --> 00:14:23,113
One if you're married,
zero if you're not married.

216
00:14:24,680 --> 00:14:25,650
The thing is,

217
00:14:25,650 --> 00:14:30,297
this assumes that the effect
of marriage on females

218
00:14:33,020 --> 00:14:35,763
and non-females is the same.

219
00:14:37,100 --> 00:14:42,100
And it's been my experience,
just say maybe anecdotally,

220
00:14:42,630 --> 00:14:44,620
that, that is not true.

221
00:14:44,620 --> 00:14:49,620
That marital status is not
the same for men and women

222
00:15:00,270 --> 00:15:01,750
or other genders.

223
00:15:01,750 --> 00:15:03,380
That in many cases,

224
00:15:03,380 --> 00:15:08,380
since it's still sort of
a traditional structure

225
00:15:08,430 --> 00:15:09,263
in many ways,

226
00:15:09,263 --> 00:15:14,150
that women play a larger
role in care and so forth,

227
00:15:16,030 --> 00:15:21,030
that the effect of
being female and married

228
00:15:23,950 --> 00:15:28,430
is different than the effect
of being male or non-female

229
00:15:28,430 --> 00:15:29,263
and married.

230
00:15:30,700 --> 00:15:33,450
So what we could do here

231
00:15:33,450 --> 00:15:38,450
is create a dummy with each of,

232
00:15:42,530 --> 00:15:44,000
so there's four of them

233
00:15:44,000 --> 00:15:49,000
and create a dummy variable
for three out of the four.

234
00:15:50,330 --> 00:15:53,820
So we would omit single non-female.

235
00:15:53,820 --> 00:15:56,520
So we would have married non-female,

236
00:15:56,520 --> 00:15:59,210
single female, married female.

237
00:15:59,210 --> 00:16:03,320
And then there would be three
groups that we would look at,

238
00:16:05,350 --> 00:16:09,750
and this would create
a different intercept

239
00:16:09,750 --> 00:16:12,350
for each of the three groups

240
00:16:12,350 --> 00:16:13,480
with beta naught

241
00:16:13,480 --> 00:16:17,053
being the intercept for
the single non-female.

242
00:16:23,360 --> 00:16:28,360
That tells you something about
how to deal with nominal,

243
00:16:28,700 --> 00:16:33,230
where you could say, are you
this thing or are you not?

244
00:16:33,230 --> 00:16:34,710
But what about ordinal?

245
00:16:34,710 --> 00:16:38,000
So a five point scale, how often do you?

246
00:16:38,000 --> 00:16:40,890
Or a five point scale
of how likely are you?

247
00:16:40,890 --> 00:16:43,560
Or how satisfied are you?

248
00:16:43,560 --> 00:16:45,973
Or how much do you agree with this?

249
00:16:47,950 --> 00:16:52,710
Thinking back to the market research,

250
00:16:52,710 --> 00:16:57,710
many psychographics are probably
best captured by a scale.

251
00:16:59,690 --> 00:17:04,690
So say that you ask a five
point scale on your survey,

252
00:17:06,070 --> 00:17:07,180
you have data

253
00:17:07,180 --> 00:17:12,180
where folks write down some
number one through five,

254
00:17:13,430 --> 00:17:16,900
such as here, one equals
never five equals always,

255
00:17:16,900 --> 00:17:18,963
how do we deal with these data?

256
00:17:24,690 --> 00:17:26,100
The most obvious answer

257
00:17:26,100 --> 00:17:30,510
would be to create up to four dummies.

258
00:17:30,510 --> 00:17:35,510
And in this case, so you
would create four of them.

259
00:17:36,670 --> 00:17:40,730
Those who say occasionally,
sometimes, often or never.

260
00:17:40,730 --> 00:17:45,730
So they would get a one
if they answered that,

261
00:17:46,140 --> 00:17:48,790
and zero everything else.

262
00:17:48,790 --> 00:17:52,890
And then never is the omitted group.

263
00:17:52,890 --> 00:17:56,280
So the folks who said never,

264
00:17:56,280 --> 00:18:00,170
their y-intercept is just beta naught,

265
00:18:00,170 --> 00:18:05,170
and those who say always, say,

266
00:18:05,610 --> 00:18:07,680
their y-intercept is the beta naught

267
00:18:07,680 --> 00:18:12,680
plus the delta for the always do variable.

268
00:18:20,650 --> 00:18:25,650
What I most often do is to
find a cutoff that makes sense.

269
00:18:27,100 --> 00:18:32,100
That is logical and create
groups of roughly equal size.

270
00:18:35,460 --> 00:18:40,460
So find a place where
about half are above this

271
00:18:40,730 --> 00:18:45,730
and half are below this
and call it high or low,

272
00:18:45,920 --> 00:18:49,660
and just create one dummy.

273
00:18:49,660 --> 00:18:54,660
And the advantage of this is
it saves degrees of freedom.

274
00:18:55,630 --> 00:18:57,843
So, you know, be,

275
00:19:01,090 --> 00:19:05,300
if about half say three or less,

276
00:19:05,300 --> 00:19:09,170
and then another half say four or five,

277
00:19:09,170 --> 00:19:10,870
then you could just do it like that

278
00:19:10,870 --> 00:19:14,600
and call it high frequency

279
00:19:14,600 --> 00:19:19,600
and then one would be those
that said four or five

280
00:19:19,640 --> 00:19:20,740
on the scale.

281
00:19:20,740 --> 00:19:24,700
Zero would be those who
answered one through three.

282
00:19:24,700 --> 00:19:29,700
But the key thing is to do it
so that it's both it's logical

283
00:19:31,030 --> 00:19:32,400
So that you can kinda say,

284
00:19:32,400 --> 00:19:35,550
oh, well, these folks are
high and these folks are low,

285
00:19:35,550 --> 00:19:40,550
as well as having about
equal numbers in each group.

286
00:19:41,960 --> 00:19:46,960
The problem of having them
where say 95% are ones

287
00:19:48,610 --> 00:19:52,930
and only 5% of the sample are zeros,

288
00:19:52,930 --> 00:19:56,210
is you basically have
a long column of ones

289
00:19:56,210 --> 00:20:01,210
with hardly any zeros in that.

290
00:20:01,760 --> 00:20:04,020
There's not a lot of information there.

291
00:20:04,020 --> 00:20:05,400
Almost everyone's a one.

292
00:20:05,400 --> 00:20:10,400
It really doesn't break out
the groups bigger in anyway,

293
00:20:10,810 --> 00:20:13,310
plus a long string of ones

294
00:20:13,310 --> 00:20:17,390
or a long string of
zeros in the data column

295
00:20:17,390 --> 00:20:22,390
is going to be highly
collinear with your intercept.

296
00:20:23,260 --> 00:20:24,093
So that's why

297
00:20:24,093 --> 00:20:28,230
you'll always want to create
groups of about equal numbers.

298
00:20:29,410 --> 00:20:34,410
And it may also make sense
to create more than one dummy

299
00:20:34,630 --> 00:20:37,470
to break it up into high, medium, low,

300
00:20:37,470 --> 00:20:39,593
or something like that.

301
00:20:40,460 --> 00:20:43,230
But again, you want it to be logical

302
00:20:43,230 --> 00:20:47,093
and have about equal
numbers in each group.

303
00:20:50,210 --> 00:20:51,043
So far,

304
00:20:51,043 --> 00:20:55,980
we've seen how a dummy variable
can change the intercept

305
00:20:57,940 --> 00:21:01,550
and we can have a number of dummies

306
00:21:01,550 --> 00:21:03,870
and even interact them.

307
00:21:03,870 --> 00:21:08,870
And that will change the
intercept of the various group

308
00:21:09,050 --> 00:21:12,630
depending on if you're a one or a zero

309
00:21:12,630 --> 00:21:17,630
for the various regressors that we have.

310
00:21:18,740 --> 00:21:20,560
But it's also possible

311
00:21:20,560 --> 00:21:25,560
to interact a dummy with
a continuous variable,

312
00:21:26,120 --> 00:21:30,980
in which case it will
change the slope of a line.

313
00:21:30,980 --> 00:21:33,430
So let's say continuing on,

314
00:21:33,430 --> 00:21:38,430
look at the effect of education
and being female on wage,

315
00:21:39,460 --> 00:21:41,460
here you see

316
00:21:41,460 --> 00:21:46,170
that we can interact them and have,

317
00:21:46,170 --> 00:21:47,890
we would, in SPSS,

318
00:21:47,890 --> 00:21:52,890
compute a variable female times education

319
00:21:56,280 --> 00:21:58,930
and then run this.

320
00:21:58,930 --> 00:22:00,700
And in this case,

321
00:22:00,700 --> 00:22:05,700
you can see that not only are
there different intercepts

322
00:22:09,640 --> 00:22:11,470
for each group,

323
00:22:11,470 --> 00:22:14,710
but the slope of the line.

324
00:22:14,710 --> 00:22:15,543
So you can see

325
00:22:15,543 --> 00:22:17,523
what the intercept here

326
00:22:26,010 --> 00:22:27,060
in the graph,

327
00:22:27,060 --> 00:22:32,060
it's clear that when this D equals one,

328
00:22:32,760 --> 00:22:33,990
that it has an intercept shift.

329
00:22:33,990 --> 00:22:37,653
And that is what we have seen so far.

330
00:22:41,480 --> 00:22:46,480
A closer look at this graph
tells you what is going on here.

331
00:22:48,370 --> 00:22:52,483
So let's say that,

332
00:22:55,300 --> 00:22:58,160
again, we're looking
at the effect of wages,

333
00:22:58,160 --> 00:23:01,080
but let's say that in this world

334
00:23:01,080 --> 00:23:04,160
that being female is an advantage.

335
00:23:04,160 --> 00:23:09,160
So this beta two here

336
00:23:10,750 --> 00:23:13,900
is the dummy variable for female.

337
00:23:13,900 --> 00:23:18,900
So you can see females have an
advantage on the y-intercept.

338
00:23:19,030 --> 00:23:23,983
And not only that, but when
you interacted with the X

339
00:23:28,820 --> 00:23:31,900
such as education or experience,

340
00:23:31,900 --> 00:23:35,600
these lines divert that the slope,

341
00:23:35,600 --> 00:23:40,600
the effect of X on Y

342
00:23:42,280 --> 00:23:47,280
if you're in the D one equals
zero group is just beta one.

343
00:23:49,290 --> 00:23:51,033
And if it is,

344
00:23:54,720 --> 00:23:58,570
if you're in the D equals one group

345
00:23:58,570 --> 00:24:03,240
then the slope is beta
one plus beta three.

346
00:24:03,240 --> 00:24:08,240
So in this case, this group
not only starts off better,

347
00:24:10,040 --> 00:24:15,040
but the returns through
education are even greater.

348
00:24:16,050 --> 00:24:19,340
So you can see it's both
a change in the intercept

349
00:24:19,340 --> 00:24:21,920
and the change in the slope.

350
00:24:21,920 --> 00:24:24,480
And I will run an example in class

351
00:24:24,480 --> 00:24:28,333
and put this on the board so
you can see it more clearly.

352
00:24:32,790 --> 00:24:36,590
You may think that your dummy variable

353
00:24:36,590 --> 00:24:40,673
has an effect on all of your regressors.

354
00:24:42,330 --> 00:24:46,600
So in this example, the dummy of female

355
00:24:47,650 --> 00:24:52,560
might not only have an effect
on returns to education

356
00:24:52,560 --> 00:24:57,560
but returns to experience
and age and other things.

357
00:24:59,390 --> 00:25:04,213
So in theory, you could have
an interaction for each one

358
00:25:05,100 --> 00:25:09,190
but this is time consuming
and it also really,

359
00:25:09,190 --> 00:25:12,023
starts to eat into your
degrees of freedom.

360
00:25:13,180 --> 00:25:16,630
So what you can do is split the sample

361
00:25:17,590 --> 00:25:22,460
and use an F-test, a
special kind of F cast

362
00:25:22,460 --> 00:25:24,340
called a Chow test,

363
00:25:24,340 --> 00:25:29,340
to see are the two groups really
different from each other?

364
00:25:32,060 --> 00:25:35,793
Do sort of females and non-females,

365
00:25:36,720 --> 00:25:40,670
are their betas the same or not?

366
00:25:40,670 --> 00:25:45,670
So in this case, you would
run three regressions.

367
00:25:47,110 --> 00:25:50,600
First with all of the regressors,

368
00:25:50,600 --> 00:25:53,030
next, only the females,

369
00:25:53,030 --> 00:25:57,730
and then in the third
case, only the non-females.

370
00:25:57,730 --> 00:26:00,510
And you wanna look at the SSR

371
00:26:00,510 --> 00:26:05,463
at the sum of squared residuals.

372
00:26:11,864 --> 00:26:14,260
So you'll run these three regressions

373
00:26:14,260 --> 00:26:17,290
and you can split the sample.

374
00:26:17,290 --> 00:26:19,490
That's how you can do it in SPSS.

375
00:26:19,490 --> 00:26:21,290
I'll show you how to do that.

376
00:26:21,290 --> 00:26:24,120
And you get three of them.

377
00:26:24,120 --> 00:26:29,120
And you look at the SSR and
you plug it into this formula.

378
00:26:29,750 --> 00:26:31,680
So you can see

379
00:26:31,680 --> 00:26:36,680
it is the relative change
in explanatory power

380
00:26:39,170 --> 00:26:43,390
by running it as a group

381
00:26:43,390 --> 00:26:48,390
versus running each group individually.

382
00:26:53,060 --> 00:26:57,250
Note that the pooled
will always be greater

383
00:26:57,250 --> 00:26:59,460
because you're gonna get a better fit

384
00:26:59,460 --> 00:27:02,840
if you give each group their own betas.

385
00:27:02,840 --> 00:27:05,440
But when you sort of adjust

386
00:27:05,440 --> 00:27:08,083
for the change in degrees of freedom,

387
00:27:10,140 --> 00:27:11,950
what happens here?

388
00:27:11,950 --> 00:27:16,080
So you do that and you
get this to test that,

389
00:27:16,080 --> 00:27:20,783
and you compare it with an F distribution

390
00:27:22,650 --> 00:27:25,530
with K plus one and N minus two K

391
00:27:25,530 --> 00:27:27,593
minus two degrees of freedom.

392
00:27:28,930 --> 00:27:33,930
And the null is that the beta
across the groups is equal.

393
00:27:37,480 --> 00:27:42,480
So you can see that if the
sum of squared residuals

394
00:27:45,460 --> 00:27:47,520
are about the same,

395
00:27:47,520 --> 00:27:52,100
that the pooled minus
the sum of the other two

396
00:27:52,100 --> 00:27:54,630
if they're really close,

397
00:27:54,630 --> 00:27:59,500
then their betas are very much alike,

398
00:27:59,500 --> 00:28:04,453
the numerator of your
F stat is very small.

399
00:28:05,300 --> 00:28:10,300
We will have, and we would
fail to reject that null,

400
00:28:11,220 --> 00:28:15,220
and that we would think
that the betas are more

401
00:28:15,220 --> 00:28:16,270
or less the same.

402
00:28:16,270 --> 00:28:17,890
But if we see a big change,

403
00:28:17,890 --> 00:28:22,890
if the betas in the two
groups are a much better fit,

404
00:28:24,680 --> 00:28:26,680
again, we're gonna see a big F stat

405
00:28:26,680 --> 00:28:30,010
and a small P, and we will reject the null

406
00:28:30,010 --> 00:28:32,400
and say that these two groups

407
00:28:32,400 --> 00:28:35,893
do not experience the
world in the same way.

408
00:28:37,420 --> 00:28:39,513
And here is just a better,

409
00:28:41,000 --> 00:28:42,130
(Instructor inhales)

410
00:28:42,130 --> 00:28:45,910
(Instructor exhales) a
more neatly laid out a way

411
00:28:45,910 --> 00:28:50,653
to express this mathematically.

412
00:28:53,250 --> 00:28:55,030
A few other asides about a Chow test.

413
00:28:55,030 --> 00:28:58,830
That you can't use the R
squared form of the F-test

414
00:28:58,830 --> 00:29:01,233
that we learned about.

415
00:29:02,490 --> 00:29:06,360
And it also assumes heteroscedasticity.

416
00:29:06,360 --> 00:29:11,360
So you would need to
test for homoscedasticity

417
00:29:12,680 --> 00:29:16,460
and make sure that is the case.

418
00:29:16,460 --> 00:29:18,763
And we'll learn how to do that soon.

419
00:29:20,670 --> 00:29:23,290
This sort of Chow test

420
00:29:23,290 --> 00:29:28,290
is also used in sort of regime switching.

421
00:29:29,000 --> 00:29:31,190
So if there's some event

422
00:29:31,190 --> 00:29:36,060
that you think will really
sort of changes the world.

423
00:29:36,060 --> 00:29:37,963
So you might wanna look at,

424
00:29:38,810 --> 00:29:43,610
the years of the Trump
administration on immigration

425
00:29:43,610 --> 00:29:46,150
or on trade.

426
00:29:46,150 --> 00:29:51,140
Because President Trump,

427
00:29:51,140 --> 00:29:54,890
greatly changed how we
go about our business

428
00:29:54,890 --> 00:29:56,410
in these things.

429
00:29:56,410 --> 00:29:57,990
We could see like,

430
00:29:57,990 --> 00:30:02,990
are the betas in sort of
before and after Trump,

431
00:30:03,240 --> 00:30:07,763
or before and during Trump the same?

432
00:30:08,640 --> 00:30:13,633
Or so was this like
really a regime change?

433
00:30:15,600 --> 00:30:18,050
Did it like fundamentally change

434
00:30:18,050 --> 00:30:19,930
the nature of the model,

435
00:30:19,930 --> 00:30:22,963
the nature of the betas or not?

436
00:30:27,400 --> 00:30:30,693
The last thing that I
want to cover here is,

437
00:30:32,200 --> 00:30:37,200
how do we deal when our
dependent is a dummy?

438
00:30:37,220 --> 00:30:39,600
So that would be one.

439
00:30:39,600 --> 00:30:44,010
So our dependent, our Y is a one.

440
00:30:44,010 --> 00:30:48,140
Are you in a group or not?

441
00:30:48,140 --> 00:30:50,770
Did you do some action or not?

442
00:30:50,770 --> 00:30:53,170
Do you own a thing or not?

443
00:30:53,170 --> 00:30:54,420
Do you own a bike?

444
00:30:54,420 --> 00:30:56,060
Do you smoke?

445
00:30:56,060 --> 00:30:57,480
Do you pass this class?

446
00:30:57,480 --> 00:30:59,130
I hope everybody's a one, of course.

447
00:30:59,130 --> 00:31:03,253
But sometimes we may want a model,

448
00:31:04,424 --> 00:31:06,623
a binary dependence.

449
00:31:07,910 --> 00:31:11,290
And when we have the model
that we've worked for,

450
00:31:11,290 --> 00:31:14,150
this Y equals beta naught
plus the beta one K

451
00:31:14,150 --> 00:31:17,333
plus dot dot dot beta K XK,

452
00:31:20,530 --> 00:31:23,500
the beta kind of doesn't really make sense

453
00:31:23,500 --> 00:31:28,500
of how does Y change as X changes,

454
00:31:28,680 --> 00:31:33,680
because Y can only have a
one or a zero of the values.

455
00:31:34,270 --> 00:31:37,780
So neither changes it or it doesn't.

456
00:31:37,780 --> 00:31:40,863
So what we do is we
look at the probability.

457
00:31:51,140 --> 00:31:52,690
So here,

458
00:31:52,690 --> 00:31:57,473
we have the probability of Y equals one

459
00:31:59,540 --> 00:32:02,163
given X is the expected value of Y.

460
00:32:05,860 --> 00:32:06,920
Given X.

461
00:32:06,920 --> 00:32:09,470
So you could think about it as,

462
00:32:09,470 --> 00:32:12,053
if you put everybody's Xs in,

463
00:32:13,440 --> 00:32:16,090
and the calculate Y,

464
00:32:16,090 --> 00:32:19,430
that would be the
probability that they are Y.

465
00:32:19,430 --> 00:32:22,330
What we would guess,
what we would forecast,

466
00:32:22,330 --> 00:32:26,440
if someone gave all of
these Xs on a survey,

467
00:32:26,440 --> 00:32:31,440
what is the probability that
they would be a Y equals one?

468
00:32:31,700 --> 00:32:35,183
What's the probability
that they own a bike?

469
00:32:36,150 --> 00:32:39,310
Yeah, and we can think about that beta J

470
00:32:39,310 --> 00:32:44,310
is a change in the probability
that you would be a Y

471
00:32:45,960 --> 00:32:47,523
holding all else equal.

472
00:32:55,500 --> 00:32:59,100
This model is good
because it's very simple

473
00:32:59,100 --> 00:33:00,810
and straight forward,

474
00:33:00,810 --> 00:33:05,810
but you could get Y hats
that would be less than zero

475
00:33:06,750 --> 00:33:08,400
or greater than one,

476
00:33:08,400 --> 00:33:10,640
which of course make no sense,

477
00:33:10,640 --> 00:33:14,170
because you can't have
negative probability

478
00:33:14,170 --> 00:33:17,320
nor probability greater than one.

479
00:33:17,320 --> 00:33:20,004
These are very often heteroscedastitic.

480
00:33:20,004 --> 00:33:21,280
So you would need to deal with that,

481
00:33:21,280 --> 00:33:22,890
which we will learn how.

482
00:33:22,890 --> 00:33:26,470
And in truth, there are
better ways to deal with this.

483
00:33:26,470 --> 00:33:28,900
But I just wanted to
sort of show you this.

484
00:33:28,900 --> 00:33:30,890
This is the way.

485
00:33:30,890 --> 00:33:35,130
It can work well in certain cases,

486
00:33:35,130 --> 00:33:39,280
but we're gonna learn a
better way to deal with it

487
00:33:42,320 --> 00:33:43,903
in a few weeks.

488
00:33:46,420 --> 00:33:48,343
That's the end, and thank you.