1 00:00:02,760 --> 00:00:04,990 - [Instructor] Hello, and welcome to the video lecture 2 00:00:04,990 --> 00:00:06,863 on dummy variables. 3 00:00:10,710 --> 00:00:13,920 So far, all of our variables, 4 00:00:13,920 --> 00:00:16,340 all the ones that we've worked for, 5 00:00:16,340 --> 00:00:20,230 have been numerical for the most part ratio 6 00:00:20,230 --> 00:00:23,543 but can be expressed as a true number. 7 00:00:25,200 --> 00:00:27,810 But many important attributes 8 00:00:27,810 --> 00:00:31,500 that we use in social science 9 00:00:31,500 --> 00:00:35,700 and other kinds of investigations are nominal or ordinal. 10 00:00:35,700 --> 00:00:40,320 Thinking about marketing research, 11 00:00:40,320 --> 00:00:43,430 you might wanna think about firmographic, demographic, 12 00:00:43,430 --> 00:00:45,510 geographic, psychographic, 13 00:00:45,510 --> 00:00:50,510 all of these may need a nominal or ordinal variable 14 00:00:53,410 --> 00:00:54,630 to express them. 15 00:00:54,630 --> 00:00:57,763 So such as do you live in Vermont? 16 00:00:59,410 --> 00:01:02,524 Or your gender 17 00:01:02,524 --> 00:01:04,320 or your race 18 00:01:04,320 --> 00:01:08,330 or your attitude about certain things. 19 00:01:08,330 --> 00:01:13,330 So it's important to have the ability to express these 20 00:01:16,000 --> 00:01:20,083 and to be able to work with these in regression analysis. 21 00:01:27,150 --> 00:01:32,150 The way that we address this is to create one or more binary 22 00:01:33,810 --> 00:01:36,453 or so-called dummy variables. 23 00:01:37,570 --> 00:01:42,570 So these can be expressed such as zero, one. 24 00:01:44,360 --> 00:01:47,470 One if you are that thing, zero if you're not. 25 00:01:47,470 --> 00:01:49,733 So Vermont resident. 26 00:01:56,430 --> 00:01:58,570 And for me, it makes more sense 27 00:01:58,570 --> 00:02:03,570 to do it like this than to call it residents or state. 28 00:02:04,160 --> 00:02:07,890 Because if you call it Vermont resident 29 00:02:07,890 --> 00:02:10,960 then you're sure what the one stands for, 30 00:02:10,960 --> 00:02:12,260 where if it's state 31 00:02:12,260 --> 00:02:14,723 it's like I don't know what state is a one. 32 00:02:16,600 --> 00:02:19,130 Certified organic. 33 00:02:19,130 --> 00:02:23,038 One of my favorites, of course, is full professor. 34 00:02:23,038 --> 00:02:27,380 And, 35 00:02:27,380 --> 00:02:29,360 gender may be expressed 36 00:02:29,360 --> 00:02:33,320 by coding somebody as one if they're a female 37 00:02:33,320 --> 00:02:36,920 and zero if they're not. 38 00:02:36,920 --> 00:02:38,460 In some cases, 39 00:02:38,460 --> 00:02:40,960 a series of them may be needed. 40 00:02:40,960 --> 00:02:43,400 If there's more than one state, 41 00:02:43,400 --> 00:02:45,390 more than, Vermont or not, 42 00:02:45,390 --> 00:02:47,660 but you might wanna also say, 43 00:02:47,660 --> 00:02:52,660 do you live in New Hampshire, Maine, et cetera? 44 00:02:52,810 --> 00:02:57,810 You may need to express gender as more than two. 45 00:02:58,040 --> 00:02:59,870 So you could have male, female 46 00:02:59,870 --> 00:03:03,350 and a third such as non-binary 47 00:03:03,350 --> 00:03:06,483 or none of these. 48 00:03:12,920 --> 00:03:16,480 So focusing a bit on gender. 49 00:03:16,480 --> 00:03:19,023 This can be the single, 50 00:03:20,750 --> 00:03:24,130 using a single dummy in a regression 51 00:03:24,130 --> 00:03:28,633 such as how does being female impact wage? 52 00:03:32,828 --> 00:03:35,773 Do females own or earn more or less 53 00:03:36,790 --> 00:03:41,790 or the same as other genders? 54 00:03:42,010 --> 00:03:44,090 So, and you might wanna think about, 55 00:03:44,090 --> 00:03:49,090 do Vermont resident students have higher or lower 56 00:03:50,940 --> 00:03:54,923 or the same GPA at UVM 57 00:03:55,770 --> 00:04:00,063 as students that are not from Vermont. 58 00:04:01,010 --> 00:04:03,770 So to ask the first question, 59 00:04:03,770 --> 00:04:08,653 we have a very simple model here with two regressors, 60 00:04:09,600 --> 00:04:11,930 female and education. 61 00:04:11,930 --> 00:04:14,740 So how many years of education 62 00:04:16,890 --> 00:04:21,330 and the effect of being female on wage. 63 00:04:21,330 --> 00:04:25,640 You can think about controlling for gender. 64 00:04:25,640 --> 00:04:27,130 What's the effect of education? 65 00:04:27,130 --> 00:04:29,740 And controlling for education, 66 00:04:29,740 --> 00:04:31,723 what is the effect of gender? 67 00:04:32,840 --> 00:04:36,550 And note that the Wooldridge books use this funky D, 68 00:04:36,550 --> 00:04:40,500 this delta D for dummy variables. 69 00:04:40,500 --> 00:04:45,133 And I'm going to continue with this convention. 70 00:04:48,380 --> 00:04:49,880 Here's our model wage. 71 00:04:49,880 --> 00:04:52,780 We have female and we have education. 72 00:04:52,780 --> 00:04:57,780 And the best way to think about the value of this delta 73 00:04:59,630 --> 00:05:03,343 is to think of it as an intercept shift. 74 00:05:04,280 --> 00:05:08,130 So depending on its magnitude, 75 00:05:08,130 --> 00:05:13,130 if it's greater than zero or less than zero, 76 00:05:13,220 --> 00:05:16,430 it's going to shift the intercept up and down. 77 00:05:16,430 --> 00:05:18,483 So you can think about it as, 78 00:05:20,010 --> 00:05:23,940 let's assume, or let's hypothesize 79 00:05:23,940 --> 00:05:28,570 that delta naught is less than zero. 80 00:05:28,570 --> 00:05:30,010 That all else equal, 81 00:05:30,010 --> 00:05:34,723 being female has a negative effect on wages. 82 00:05:36,400 --> 00:05:40,420 Which I think, historically, has largely been true. 83 00:05:40,420 --> 00:05:44,063 So you can see it as, 84 00:05:45,740 --> 00:05:50,683 that delta naught is, 85 00:05:53,440 --> 00:05:56,650 the expected value of wage 86 00:05:56,650 --> 00:06:00,740 given that someone is female 87 00:06:00,740 --> 00:06:03,860 and given some amount of education, 88 00:06:03,860 --> 00:06:08,490 or it's the difference, 89 00:06:08,490 --> 00:06:12,120 this last point of the expected wage 90 00:06:12,120 --> 00:06:16,490 for a given amount of education 91 00:06:16,490 --> 00:06:19,090 between females and non-females. 92 00:06:19,090 --> 00:06:22,740 And when we talk about this in class 93 00:06:22,740 --> 00:06:25,420 I will draw a simple graph. 94 00:06:25,420 --> 00:06:26,740 But you can think of it 95 00:06:26,740 --> 00:06:31,200 as is it actually is an intercept shift. 96 00:06:31,200 --> 00:06:34,700 So in this example, 97 00:06:37,130 --> 00:06:42,130 the Y intercept for non-females is beta naught 98 00:06:43,070 --> 00:06:46,580 and the intercept for females 99 00:06:46,580 --> 00:06:48,820 is beta naught plus delta naught. 100 00:06:48,820 --> 00:06:52,553 So it really does just shift the intercept. 101 00:06:55,780 --> 00:06:58,990 One thing that you have to be very careful of, 102 00:06:58,990 --> 00:07:03,563 is to not fall into the dummy variable trap. 103 00:07:04,690 --> 00:07:08,180 Now, of course, this is a well-known phenomenon. 104 00:07:08,180 --> 00:07:13,067 First discovered a long time ago in a galaxy far, far away 105 00:07:15,270 --> 00:07:17,090 by Admiral Ackbar. 106 00:07:17,090 --> 00:07:20,140 But the bottom line is, 107 00:07:20,140 --> 00:07:25,020 you always have to have an omitted category. 108 00:07:25,020 --> 00:07:29,410 So let's say that you're interested again in two groups, 109 00:07:29,410 --> 00:07:32,207 Vermont residents, one, 110 00:07:32,207 --> 00:07:36,290 and residents of other states. 111 00:07:36,290 --> 00:07:38,050 And you've named them 112 00:07:38,050 --> 00:07:38,923 and, 113 00:07:40,570 --> 00:07:45,570 everyone is either a one or zero. 114 00:07:45,720 --> 00:07:47,520 So if you live in Vermont, 115 00:07:47,520 --> 00:07:51,120 you're a one for Vermont res 116 00:07:51,120 --> 00:07:53,740 and a zero for other res. 117 00:07:53,740 --> 00:07:55,960 And if you live in New York, 118 00:07:55,960 --> 00:08:00,170 say, you're a zero for Vermont res 119 00:08:00,170 --> 00:08:03,670 and a one for other res. 120 00:08:03,670 --> 00:08:05,320 So why not use both? 121 00:08:05,320 --> 00:08:08,600 Why not put them both into the regression? 122 00:08:08,600 --> 00:08:10,763 What's the problem here? 123 00:08:11,810 --> 00:08:13,750 I'll let you think about that for a moment. 124 00:08:13,750 --> 00:08:16,123 Feel free to pause and think about it. 125 00:08:17,150 --> 00:08:21,403 And then I will tell you. 126 00:08:22,730 --> 00:08:27,020 So the problem is they always add to one. 127 00:08:27,020 --> 00:08:29,480 They are perfectly collinear. 128 00:08:29,480 --> 00:08:33,690 So adding up Vermont for any individual, 129 00:08:33,690 --> 00:08:38,690 Vermont res plus other res equals one for any individual. 130 00:08:39,860 --> 00:08:42,530 So you can only include one. 131 00:08:42,530 --> 00:08:45,900 And the other way of thinking about it is, 132 00:08:45,900 --> 00:08:49,060 if you include Vermont res, 133 00:08:49,060 --> 00:08:53,830 the variable of other res adds no new information. 134 00:08:53,830 --> 00:08:57,610 And actually, the regression will fall apart 135 00:08:57,610 --> 00:09:00,720 because we have perfect collinearity. 136 00:09:00,720 --> 00:09:04,610 So you always have to have at least one omitted group. 137 00:09:04,610 --> 00:09:06,540 And that the way that you code that 138 00:09:08,000 --> 00:09:11,840 will determine what is your omitted group. 139 00:09:11,840 --> 00:09:13,810 So in this case, 140 00:09:13,810 --> 00:09:18,110 this Vermont res would be, what is the intercept shift, 141 00:09:18,110 --> 00:09:21,160 or what is the effect of a Vermont resident 142 00:09:21,160 --> 00:09:24,840 on our Y holding all else equal? 143 00:09:24,840 --> 00:09:29,840 And in the comparison is those who live in another state. 144 00:09:31,180 --> 00:09:33,293 And I hopefully this all makes sense. 145 00:09:37,880 --> 00:09:40,530 Again, we can see this graphically 146 00:09:40,530 --> 00:09:44,790 as being literally an intercept shift. 147 00:09:44,790 --> 00:09:46,370 Note that this, I know, 148 00:09:46,370 --> 00:09:49,120 this graph I just got it from the Web. 149 00:09:49,120 --> 00:09:54,120 And it assumes that that everyone is either female or male. 150 00:09:54,590 --> 00:09:58,180 And of course that may or may not be true. 151 00:09:58,180 --> 00:10:03,180 But just you can think of, on the graph here, 152 00:10:04,570 --> 00:10:08,530 what it says male would just be non-females. 153 00:10:08,530 --> 00:10:11,810 So here, non-females are the base group, 154 00:10:11,810 --> 00:10:14,260 the benchmark or the omitted group. 155 00:10:14,260 --> 00:10:19,260 So their y-intercept is just whatever, 156 00:10:20,490 --> 00:10:22,460 alpha naught here or beta naught, 157 00:10:22,460 --> 00:10:25,810 as we have been denoting it. 158 00:10:25,810 --> 00:10:29,270 And the intercept for females is beta naught 159 00:10:29,270 --> 00:10:31,590 plus delta naught, 160 00:10:31,590 --> 00:10:36,060 or here in the graph, alpha naught plus delta naught. 161 00:10:36,060 --> 00:10:39,410 And you can see in this case, 162 00:10:39,410 --> 00:10:44,380 the coefficient delta naught is less than zero. 163 00:10:44,380 --> 00:10:49,380 That the effect of female on wages is negative. 164 00:10:50,520 --> 00:10:54,373 So holding education constant, 165 00:10:57,270 --> 00:10:58,863 in this case, 166 00:10:59,920 --> 00:11:03,560 the effect of female is negative. 167 00:11:03,560 --> 00:11:07,590 So we would predict on average, 168 00:11:07,590 --> 00:11:10,920 given an amount of education, 169 00:11:10,920 --> 00:11:12,410 that a female's wage 170 00:11:12,410 --> 00:11:16,053 would be delta naught less than a non-female's wage. 171 00:11:20,270 --> 00:11:23,660 You can of course add many other regressors. 172 00:11:23,660 --> 00:11:27,790 You may, not only education, but you might have experience, 173 00:11:27,790 --> 00:11:30,703 how long have you been in the workforce in general, 174 00:11:32,626 --> 00:11:33,459 and you'll have 175 00:11:33,459 --> 00:11:37,270 how long have they been with this particular company? 176 00:11:42,063 --> 00:11:43,680 You could add a bunch of others. 177 00:11:43,680 --> 00:11:47,143 But how would you test for gender discrimination? 178 00:11:49,310 --> 00:11:52,060 Basically, as says up here, 179 00:11:52,060 --> 00:11:56,970 it would be a simple T-test on our delta naught. 180 00:11:56,970 --> 00:12:01,150 And I will let you ponder what would be the advantages 181 00:12:01,150 --> 00:12:05,210 or disadvantages of a one or a two-tailed test. 182 00:12:05,210 --> 00:12:07,060 So think back to last week, 183 00:12:07,060 --> 00:12:12,060 what would be the advantages to a one or two-tail test, 184 00:12:12,640 --> 00:12:15,263 and we will discuss it in class. 185 00:12:21,860 --> 00:12:24,370 But basically, what this analysis does is, 186 00:12:24,370 --> 00:12:26,740 holding all else equal, 187 00:12:26,740 --> 00:12:30,623 education, experience, et cetera, 188 00:12:31,660 --> 00:12:34,560 do females earn lower ages? 189 00:12:34,560 --> 00:12:39,560 If you could think about other examples that you might use. 190 00:12:39,780 --> 00:12:42,540 Do Vermont students do better in class? 191 00:12:42,540 --> 00:12:44,330 That's one we thought of. 192 00:12:44,330 --> 00:12:49,330 Does owning a computer improve your GPA. 193 00:12:51,210 --> 00:12:54,420 And I think you could think of a lot of other examples 194 00:12:54,420 --> 00:12:58,190 where you would have a dummy variable as some, 195 00:12:58,190 --> 00:13:01,883 just to test some kind of a policy. 196 00:13:03,270 --> 00:13:06,910 And in some cases it might make sense 197 00:13:06,910 --> 00:13:11,210 such as give half the students computer 198 00:13:11,210 --> 00:13:13,930 and not give them half. 199 00:13:13,930 --> 00:13:16,490 That you could see the omitted group, 200 00:13:16,490 --> 00:13:18,570 those that are a zero, 201 00:13:18,570 --> 00:13:23,480 didn't get a computer as a control group. 202 00:13:23,480 --> 00:13:28,480 And the ones the got a computer would be the experimental 203 00:13:29,880 --> 00:13:31,423 or the treatment group. 204 00:13:36,900 --> 00:13:38,160 There may be times 205 00:13:38,160 --> 00:13:42,763 when you want to include more than one dummy variable. 206 00:13:46,220 --> 00:13:47,550 For example, 207 00:13:47,550 --> 00:13:50,550 does marital status have an effect? 208 00:13:50,550 --> 00:13:55,550 So that would be holding your gender constant 209 00:13:55,930 --> 00:13:58,953 and all the other factors, education, 210 00:14:00,850 --> 00:14:04,893 does your marital status impact your wage? 211 00:14:06,780 --> 00:14:08,950 One of the things that you could do 212 00:14:08,950 --> 00:14:11,510 is that the sort of simplest thing, 213 00:14:11,510 --> 00:14:14,930 would just be to include one that says, 214 00:14:14,930 --> 00:14:19,310 say, married, a dominant variable married. 215 00:14:19,310 --> 00:14:23,113 One if you're married, zero if you're not married. 216 00:14:24,680 --> 00:14:25,650 The thing is, 217 00:14:25,650 --> 00:14:30,297 this assumes that the effect of marriage on females 218 00:14:33,020 --> 00:14:35,763 and non-females is the same. 219 00:14:37,100 --> 00:14:42,100 And it's been my experience, just say maybe anecdotally, 220 00:14:42,630 --> 00:14:44,620 that, that is not true. 221 00:14:44,620 --> 00:14:49,620 That marital status is not the same for men and women 222 00:15:00,270 --> 00:15:01,750 or other genders. 223 00:15:01,750 --> 00:15:03,380 That in many cases, 224 00:15:03,380 --> 00:15:08,380 since it's still sort of a traditional structure 225 00:15:08,430 --> 00:15:09,263 in many ways, 226 00:15:09,263 --> 00:15:14,150 that women play a larger role in care and so forth, 227 00:15:16,030 --> 00:15:21,030 that the effect of being female and married 228 00:15:23,950 --> 00:15:28,430 is different than the effect of being male or non-female 229 00:15:28,430 --> 00:15:29,263 and married. 230 00:15:30,700 --> 00:15:33,450 So what we could do here 231 00:15:33,450 --> 00:15:38,450 is create a dummy with each of, 232 00:15:42,530 --> 00:15:44,000 so there's four of them 233 00:15:44,000 --> 00:15:49,000 and create a dummy variable for three out of the four. 234 00:15:50,330 --> 00:15:53,820 So we would omit single non-female. 235 00:15:53,820 --> 00:15:56,520 So we would have married non-female, 236 00:15:56,520 --> 00:15:59,210 single female, married female. 237 00:15:59,210 --> 00:16:03,320 And then there would be three groups that we would look at, 238 00:16:05,350 --> 00:16:09,750 and this would create a different intercept 239 00:16:09,750 --> 00:16:12,350 for each of the three groups 240 00:16:12,350 --> 00:16:13,480 with beta naught 241 00:16:13,480 --> 00:16:17,053 being the intercept for the single non-female. 242 00:16:23,360 --> 00:16:28,360 That tells you something about how to deal with nominal, 243 00:16:28,700 --> 00:16:33,230 where you could say, are you this thing or are you not? 244 00:16:33,230 --> 00:16:34,710 But what about ordinal? 245 00:16:34,710 --> 00:16:38,000 So a five point scale, how often do you? 246 00:16:38,000 --> 00:16:40,890 Or a five point scale of how likely are you? 247 00:16:40,890 --> 00:16:43,560 Or how satisfied are you? 248 00:16:43,560 --> 00:16:45,973 Or how much do you agree with this? 249 00:16:47,950 --> 00:16:52,710 Thinking back to the market research, 250 00:16:52,710 --> 00:16:57,710 many psychographics are probably best captured by a scale. 251 00:16:59,690 --> 00:17:04,690 So say that you ask a five point scale on your survey, 252 00:17:06,070 --> 00:17:07,180 you have data 253 00:17:07,180 --> 00:17:12,180 where folks write down some number one through five, 254 00:17:13,430 --> 00:17:16,900 such as here, one equals never five equals always, 255 00:17:16,900 --> 00:17:18,963 how do we deal with these data? 256 00:17:24,690 --> 00:17:26,100 The most obvious answer 257 00:17:26,100 --> 00:17:30,510 would be to create up to four dummies. 258 00:17:30,510 --> 00:17:35,510 And in this case, so you would create four of them. 259 00:17:36,670 --> 00:17:40,730 Those who say occasionally, sometimes, often or never. 260 00:17:40,730 --> 00:17:45,730 So they would get a one if they answered that, 261 00:17:46,140 --> 00:17:48,790 and zero everything else. 262 00:17:48,790 --> 00:17:52,890 And then never is the omitted group. 263 00:17:52,890 --> 00:17:56,280 So the folks who said never, 264 00:17:56,280 --> 00:18:00,170 their y-intercept is just beta naught, 265 00:18:00,170 --> 00:18:05,170 and those who say always, say, 266 00:18:05,610 --> 00:18:07,680 their y-intercept is the beta naught 267 00:18:07,680 --> 00:18:12,680 plus the delta for the always do variable. 268 00:18:20,650 --> 00:18:25,650 What I most often do is to find a cutoff that makes sense. 269 00:18:27,100 --> 00:18:32,100 That is logical and create groups of roughly equal size. 270 00:18:35,460 --> 00:18:40,460 So find a place where about half are above this 271 00:18:40,730 --> 00:18:45,730 and half are below this and call it high or low, 272 00:18:45,920 --> 00:18:49,660 and just create one dummy. 273 00:18:49,660 --> 00:18:54,660 And the advantage of this is it saves degrees of freedom. 274 00:18:55,630 --> 00:18:57,843 So, you know, be, 275 00:19:01,090 --> 00:19:05,300 if about half say three or less, 276 00:19:05,300 --> 00:19:09,170 and then another half say four or five, 277 00:19:09,170 --> 00:19:10,870 then you could just do it like that 278 00:19:10,870 --> 00:19:14,600 and call it high frequency 279 00:19:14,600 --> 00:19:19,600 and then one would be those that said four or five 280 00:19:19,640 --> 00:19:20,740 on the scale. 281 00:19:20,740 --> 00:19:24,700 Zero would be those who answered one through three. 282 00:19:24,700 --> 00:19:29,700 But the key thing is to do it so that it's both it's logical 283 00:19:31,030 --> 00:19:32,400 So that you can kinda say, 284 00:19:32,400 --> 00:19:35,550 oh, well, these folks are high and these folks are low, 285 00:19:35,550 --> 00:19:40,550 as well as having about equal numbers in each group. 286 00:19:41,960 --> 00:19:46,960 The problem of having them where say 95% are ones 287 00:19:48,610 --> 00:19:52,930 and only 5% of the sample are zeros, 288 00:19:52,930 --> 00:19:56,210 is you basically have a long column of ones 289 00:19:56,210 --> 00:20:01,210 with hardly any zeros in that. 290 00:20:01,760 --> 00:20:04,020 There's not a lot of information there. 291 00:20:04,020 --> 00:20:05,400 Almost everyone's a one. 292 00:20:05,400 --> 00:20:10,400 It really doesn't break out the groups bigger in anyway, 293 00:20:10,810 --> 00:20:13,310 plus a long string of ones 294 00:20:13,310 --> 00:20:17,390 or a long string of zeros in the data column 295 00:20:17,390 --> 00:20:22,390 is going to be highly collinear with your intercept. 296 00:20:23,260 --> 00:20:24,093 So that's why 297 00:20:24,093 --> 00:20:28,230 you'll always want to create groups of about equal numbers. 298 00:20:29,410 --> 00:20:34,410 And it may also make sense to create more than one dummy 299 00:20:34,630 --> 00:20:37,470 to break it up into high, medium, low, 300 00:20:37,470 --> 00:20:39,593 or something like that. 301 00:20:40,460 --> 00:20:43,230 But again, you want it to be logical 302 00:20:43,230 --> 00:20:47,093 and have about equal numbers in each group. 303 00:20:50,210 --> 00:20:51,043 So far, 304 00:20:51,043 --> 00:20:55,980 we've seen how a dummy variable can change the intercept 305 00:20:57,940 --> 00:21:01,550 and we can have a number of dummies 306 00:21:01,550 --> 00:21:03,870 and even interact them. 307 00:21:03,870 --> 00:21:08,870 And that will change the intercept of the various group 308 00:21:09,050 --> 00:21:12,630 depending on if you're a one or a zero 309 00:21:12,630 --> 00:21:17,630 for the various regressors that we have. 310 00:21:18,740 --> 00:21:20,560 But it's also possible 311 00:21:20,560 --> 00:21:25,560 to interact a dummy with a continuous variable, 312 00:21:26,120 --> 00:21:30,980 in which case it will change the slope of a line. 313 00:21:30,980 --> 00:21:33,430 So let's say continuing on, 314 00:21:33,430 --> 00:21:38,430 look at the effect of education and being female on wage, 315 00:21:39,460 --> 00:21:41,460 here you see 316 00:21:41,460 --> 00:21:46,170 that we can interact them and have, 317 00:21:46,170 --> 00:21:47,890 we would, in SPSS, 318 00:21:47,890 --> 00:21:52,890 compute a variable female times education 319 00:21:56,280 --> 00:21:58,930 and then run this. 320 00:21:58,930 --> 00:22:00,700 And in this case, 321 00:22:00,700 --> 00:22:05,700 you can see that not only are there different intercepts 322 00:22:09,640 --> 00:22:11,470 for each group, 323 00:22:11,470 --> 00:22:14,710 but the slope of the line. 324 00:22:14,710 --> 00:22:15,543 So you can see 325 00:22:15,543 --> 00:22:17,523 what the intercept here 326 00:22:26,010 --> 00:22:27,060 in the graph, 327 00:22:27,060 --> 00:22:32,060 it's clear that when this D equals one, 328 00:22:32,760 --> 00:22:33,990 that it has an intercept shift. 329 00:22:33,990 --> 00:22:37,653 And that is what we have seen so far. 330 00:22:41,480 --> 00:22:46,480 A closer look at this graph tells you what is going on here. 331 00:22:48,370 --> 00:22:52,483 So let's say that, 332 00:22:55,300 --> 00:22:58,160 again, we're looking at the effect of wages, 333 00:22:58,160 --> 00:23:01,080 but let's say that in this world 334 00:23:01,080 --> 00:23:04,160 that being female is an advantage. 335 00:23:04,160 --> 00:23:09,160 So this beta two here 336 00:23:10,750 --> 00:23:13,900 is the dummy variable for female. 337 00:23:13,900 --> 00:23:18,900 So you can see females have an advantage on the y-intercept. 338 00:23:19,030 --> 00:23:23,983 And not only that, but when you interacted with the X 339 00:23:28,820 --> 00:23:31,900 such as education or experience, 340 00:23:31,900 --> 00:23:35,600 these lines divert that the slope, 341 00:23:35,600 --> 00:23:40,600 the effect of X on Y 342 00:23:42,280 --> 00:23:47,280 if you're in the D one equals zero group is just beta one. 343 00:23:49,290 --> 00:23:51,033 And if it is, 344 00:23:54,720 --> 00:23:58,570 if you're in the D equals one group 345 00:23:58,570 --> 00:24:03,240 then the slope is beta one plus beta three. 346 00:24:03,240 --> 00:24:08,240 So in this case, this group not only starts off better, 347 00:24:10,040 --> 00:24:15,040 but the returns through education are even greater. 348 00:24:16,050 --> 00:24:19,340 So you can see it's both a change in the intercept 349 00:24:19,340 --> 00:24:21,920 and the change in the slope. 350 00:24:21,920 --> 00:24:24,480 And I will run an example in class 351 00:24:24,480 --> 00:24:28,333 and put this on the board so you can see it more clearly. 352 00:24:32,790 --> 00:24:36,590 You may think that your dummy variable 353 00:24:36,590 --> 00:24:40,673 has an effect on all of your regressors. 354 00:24:42,330 --> 00:24:46,600 So in this example, the dummy of female 355 00:24:47,650 --> 00:24:52,560 might not only have an effect on returns to education 356 00:24:52,560 --> 00:24:57,560 but returns to experience and age and other things. 357 00:24:59,390 --> 00:25:04,213 So in theory, you could have an interaction for each one 358 00:25:05,100 --> 00:25:09,190 but this is time consuming and it also really, 359 00:25:09,190 --> 00:25:12,023 starts to eat into your degrees of freedom. 360 00:25:13,180 --> 00:25:16,630 So what you can do is split the sample 361 00:25:17,590 --> 00:25:22,460 and use an F-test, a special kind of F cast 362 00:25:22,460 --> 00:25:24,340 called a Chow test, 363 00:25:24,340 --> 00:25:29,340 to see are the two groups really different from each other? 364 00:25:32,060 --> 00:25:35,793 Do sort of females and non-females, 365 00:25:36,720 --> 00:25:40,670 are their betas the same or not? 366 00:25:40,670 --> 00:25:45,670 So in this case, you would run three regressions. 367 00:25:47,110 --> 00:25:50,600 First with all of the regressors, 368 00:25:50,600 --> 00:25:53,030 next, only the females, 369 00:25:53,030 --> 00:25:57,730 and then in the third case, only the non-females. 370 00:25:57,730 --> 00:26:00,510 And you wanna look at the SSR 371 00:26:00,510 --> 00:26:05,463 at the sum of squared residuals. 372 00:26:11,864 --> 00:26:14,260 So you'll run these three regressions 373 00:26:14,260 --> 00:26:17,290 and you can split the sample. 374 00:26:17,290 --> 00:26:19,490 That's how you can do it in SPSS. 375 00:26:19,490 --> 00:26:21,290 I'll show you how to do that. 376 00:26:21,290 --> 00:26:24,120 And you get three of them. 377 00:26:24,120 --> 00:26:29,120 And you look at the SSR and you plug it into this formula. 378 00:26:29,750 --> 00:26:31,680 So you can see 379 00:26:31,680 --> 00:26:36,680 it is the relative change in explanatory power 380 00:26:39,170 --> 00:26:43,390 by running it as a group 381 00:26:43,390 --> 00:26:48,390 versus running each group individually. 382 00:26:53,060 --> 00:26:57,250 Note that the pooled will always be greater 383 00:26:57,250 --> 00:26:59,460 because you're gonna get a better fit 384 00:26:59,460 --> 00:27:02,840 if you give each group their own betas. 385 00:27:02,840 --> 00:27:05,440 But when you sort of adjust 386 00:27:05,440 --> 00:27:08,083 for the change in degrees of freedom, 387 00:27:10,140 --> 00:27:11,950 what happens here? 388 00:27:11,950 --> 00:27:16,080 So you do that and you get this to test that, 389 00:27:16,080 --> 00:27:20,783 and you compare it with an F distribution 390 00:27:22,650 --> 00:27:25,530 with K plus one and N minus two K 391 00:27:25,530 --> 00:27:27,593 minus two degrees of freedom. 392 00:27:28,930 --> 00:27:33,930 And the null is that the beta across the groups is equal. 393 00:27:37,480 --> 00:27:42,480 So you can see that if the sum of squared residuals 394 00:27:45,460 --> 00:27:47,520 are about the same, 395 00:27:47,520 --> 00:27:52,100 that the pooled minus the sum of the other two 396 00:27:52,100 --> 00:27:54,630 if they're really close, 397 00:27:54,630 --> 00:27:59,500 then their betas are very much alike, 398 00:27:59,500 --> 00:28:04,453 the numerator of your F stat is very small. 399 00:28:05,300 --> 00:28:10,300 We will have, and we would fail to reject that null, 400 00:28:11,220 --> 00:28:15,220 and that we would think that the betas are more 401 00:28:15,220 --> 00:28:16,270 or less the same. 402 00:28:16,270 --> 00:28:17,890 But if we see a big change, 403 00:28:17,890 --> 00:28:22,890 if the betas in the two groups are a much better fit, 404 00:28:24,680 --> 00:28:26,680 again, we're gonna see a big F stat 405 00:28:26,680 --> 00:28:30,010 and a small P, and we will reject the null 406 00:28:30,010 --> 00:28:32,400 and say that these two groups 407 00:28:32,400 --> 00:28:35,893 do not experience the world in the same way. 408 00:28:37,420 --> 00:28:39,513 And here is just a better, 409 00:28:41,000 --> 00:28:42,130 (Instructor inhales) 410 00:28:42,130 --> 00:28:45,910 (Instructor exhales) a more neatly laid out a way 411 00:28:45,910 --> 00:28:50,653 to express this mathematically. 412 00:28:53,250 --> 00:28:55,030 A few other asides about a Chow test. 413 00:28:55,030 --> 00:28:58,830 That you can't use the R squared form of the F-test 414 00:28:58,830 --> 00:29:01,233 that we learned about. 415 00:29:02,490 --> 00:29:06,360 And it also assumes heteroscedasticity. 416 00:29:06,360 --> 00:29:11,360 So you would need to test for homoscedasticity 417 00:29:12,680 --> 00:29:16,460 and make sure that is the case. 418 00:29:16,460 --> 00:29:18,763 And we'll learn how to do that soon. 419 00:29:20,670 --> 00:29:23,290 This sort of Chow test 420 00:29:23,290 --> 00:29:28,290 is also used in sort of regime switching. 421 00:29:29,000 --> 00:29:31,190 So if there's some event 422 00:29:31,190 --> 00:29:36,060 that you think will really sort of changes the world. 423 00:29:36,060 --> 00:29:37,963 So you might wanna look at, 424 00:29:38,810 --> 00:29:43,610 the years of the Trump administration on immigration 425 00:29:43,610 --> 00:29:46,150 or on trade. 426 00:29:46,150 --> 00:29:51,140 Because President Trump, 427 00:29:51,140 --> 00:29:54,890 greatly changed how we go about our business 428 00:29:54,890 --> 00:29:56,410 in these things. 429 00:29:56,410 --> 00:29:57,990 We could see like, 430 00:29:57,990 --> 00:30:02,990 are the betas in sort of before and after Trump, 431 00:30:03,240 --> 00:30:07,763 or before and during Trump the same? 432 00:30:08,640 --> 00:30:13,633 Or so was this like really a regime change? 433 00:30:15,600 --> 00:30:18,050 Did it like fundamentally change 434 00:30:18,050 --> 00:30:19,930 the nature of the model, 435 00:30:19,930 --> 00:30:22,963 the nature of the betas or not? 436 00:30:27,400 --> 00:30:30,693 The last thing that I want to cover here is, 437 00:30:32,200 --> 00:30:37,200 how do we deal when our dependent is a dummy? 438 00:30:37,220 --> 00:30:39,600 So that would be one. 439 00:30:39,600 --> 00:30:44,010 So our dependent, our Y is a one. 440 00:30:44,010 --> 00:30:48,140 Are you in a group or not? 441 00:30:48,140 --> 00:30:50,770 Did you do some action or not? 442 00:30:50,770 --> 00:30:53,170 Do you own a thing or not? 443 00:30:53,170 --> 00:30:54,420 Do you own a bike? 444 00:30:54,420 --> 00:30:56,060 Do you smoke? 445 00:30:56,060 --> 00:30:57,480 Do you pass this class? 446 00:30:57,480 --> 00:30:59,130 I hope everybody's a one, of course. 447 00:30:59,130 --> 00:31:03,253 But sometimes we may want a model, 448 00:31:04,424 --> 00:31:06,623 a binary dependence. 449 00:31:07,910 --> 00:31:11,290 And when we have the model that we've worked for, 450 00:31:11,290 --> 00:31:14,150 this Y equals beta naught plus the beta one K 451 00:31:14,150 --> 00:31:17,333 plus dot dot dot beta K XK, 452 00:31:20,530 --> 00:31:23,500 the beta kind of doesn't really make sense 453 00:31:23,500 --> 00:31:28,500 of how does Y change as X changes, 454 00:31:28,680 --> 00:31:33,680 because Y can only have a one or a zero of the values. 455 00:31:34,270 --> 00:31:37,780 So neither changes it or it doesn't. 456 00:31:37,780 --> 00:31:40,863 So what we do is we look at the probability. 457 00:31:51,140 --> 00:31:52,690 So here, 458 00:31:52,690 --> 00:31:57,473 we have the probability of Y equals one 459 00:31:59,540 --> 00:32:02,163 given X is the expected value of Y. 460 00:32:05,860 --> 00:32:06,920 Given X. 461 00:32:06,920 --> 00:32:09,470 So you could think about it as, 462 00:32:09,470 --> 00:32:12,053 if you put everybody's Xs in, 463 00:32:13,440 --> 00:32:16,090 and the calculate Y, 464 00:32:16,090 --> 00:32:19,430 that would be the probability that they are Y. 465 00:32:19,430 --> 00:32:22,330 What we would guess, what we would forecast, 466 00:32:22,330 --> 00:32:26,440 if someone gave all of these Xs on a survey, 467 00:32:26,440 --> 00:32:31,440 what is the probability that they would be a Y equals one? 468 00:32:31,700 --> 00:32:35,183 What's the probability that they own a bike? 469 00:32:36,150 --> 00:32:39,310 Yeah, and we can think about that beta J 470 00:32:39,310 --> 00:32:44,310 is a change in the probability that you would be a Y 471 00:32:45,960 --> 00:32:47,523 holding all else equal. 472 00:32:55,500 --> 00:32:59,100 This model is good because it's very simple 473 00:32:59,100 --> 00:33:00,810 and straight forward, 474 00:33:00,810 --> 00:33:05,810 but you could get Y hats that would be less than zero 475 00:33:06,750 --> 00:33:08,400 or greater than one, 476 00:33:08,400 --> 00:33:10,640 which of course make no sense, 477 00:33:10,640 --> 00:33:14,170 because you can't have negative probability 478 00:33:14,170 --> 00:33:17,320 nor probability greater than one. 479 00:33:17,320 --> 00:33:20,004 These are very often heteroscedastitic. 480 00:33:20,004 --> 00:33:21,280 So you would need to deal with that, 481 00:33:21,280 --> 00:33:22,890 which we will learn how. 482 00:33:22,890 --> 00:33:26,470 And in truth, there are better ways to deal with this. 483 00:33:26,470 --> 00:33:28,900 But I just wanted to sort of show you this. 484 00:33:28,900 --> 00:33:30,890 This is the way. 485 00:33:30,890 --> 00:33:35,130 It can work well in certain cases, 486 00:33:35,130 --> 00:33:39,280 but we're gonna learn a better way to deal with it 487 00:33:42,320 --> 00:33:43,903 in a few weeks. 488 00:33:46,420 --> 00:33:48,343 That's the end, and thank you.