1 00:00:01,860 --> 00:00:03,120 - [Instructor] Hi everyone. 2 00:00:03,120 --> 00:00:05,170 This week we're going to be learning with 3 00:00:06,990 --> 00:00:09,640 or learning about instrumental variables 4 00:00:09,640 --> 00:00:11,750 and two-staged least squares. 5 00:00:11,750 --> 00:00:16,750 This is a way that we can deal with endogenous regressors. 6 00:00:18,390 --> 00:00:22,480 Last time, we had an endogeneity problem 7 00:00:22,480 --> 00:00:24,840 that we encountered with panel data, 8 00:00:24,840 --> 00:00:28,250 and we learned about two data transformations 9 00:00:28,250 --> 00:00:31,750 that you can use to account for this. 10 00:00:31,750 --> 00:00:34,770 And we're gonna learn a new method 11 00:00:34,770 --> 00:00:38,293 on how to deal with endogeneity. 12 00:00:41,300 --> 00:00:45,400 So so far, we've just been assuming, 13 00:00:45,400 --> 00:00:50,260 or we transformed data such that the error term 14 00:00:50,260 --> 00:00:55,260 and each one of our K regressors are not related, 15 00:00:56,320 --> 00:01:00,800 that no change in the regressors 16 00:01:00,800 --> 00:01:03,210 result in any change in the error term, 17 00:01:03,210 --> 00:01:08,210 DU/DXk so far, it equals zero 18 00:01:08,890 --> 00:01:12,180 and we assume that the covariance of U 19 00:01:12,180 --> 00:01:14,200 and each regressors equals zero. 20 00:01:14,200 --> 00:01:17,920 Now, what do we do when they are not equal to zero? 21 00:01:17,920 --> 00:01:20,280 When a change in the regressor 22 00:01:20,280 --> 00:01:24,130 does result in a change in the error term, 23 00:01:24,130 --> 00:01:28,823 that is their covariance is not equal to zero? 24 00:01:29,840 --> 00:01:33,460 So basically what we need is an instrument. 25 00:01:33,460 --> 00:01:37,700 So we need another variable, one or more, 26 00:01:37,700 --> 00:01:41,900 which is itself exogenous 27 00:01:41,900 --> 00:01:46,500 and that when you sort of put in place to be an instrument 28 00:01:46,500 --> 00:01:48,683 for the endogenous one. 29 00:01:53,300 --> 00:01:55,310 By way of introduction, 30 00:01:55,310 --> 00:01:57,460 again, this week, we're gonna learn 31 00:01:57,460 --> 00:02:01,130 another way to deal with endogeneity. 32 00:02:01,130 --> 00:02:02,593 And just as an aside, 33 00:02:04,490 --> 00:02:09,490 a lot of what we're doing in the work after the midterm exam 34 00:02:11,160 --> 00:02:15,763 is looking at ways which eliminate endogeneity 35 00:02:17,350 --> 00:02:20,290 and result in an unbiased estimator, 36 00:02:20,290 --> 00:02:24,250 but at a cost of the efficiency. 37 00:02:24,250 --> 00:02:27,819 So keep that in mind as an overarching 38 00:02:27,819 --> 00:02:30,040 (static interference drowns out speaker) as we go. 39 00:02:30,040 --> 00:02:34,850 So the three main causes of endogeneity 40 00:02:34,850 --> 00:02:36,690 are omitted variable. 41 00:02:36,690 --> 00:02:40,010 So you forgot to ask a certain question, 42 00:02:40,010 --> 00:02:45,010 or you were not able to ask a certain question on a survey. 43 00:02:46,600 --> 00:02:51,600 Measurement error, where you got what you know 44 00:02:51,600 --> 00:02:56,600 is a flawed example of the value of a regressor, 45 00:02:57,530 --> 00:03:01,110 which I'm gonna talk about an example in my own work. 46 00:03:01,110 --> 00:03:03,700 And then next time we're gonna talk about 47 00:03:03,700 --> 00:03:06,860 simultaneous equations and simultaneity, 48 00:03:06,860 --> 00:03:09,423 which is gonna be the theme of next week. 49 00:03:15,610 --> 00:03:20,240 Last time, we learned how to difference away 50 00:03:20,240 --> 00:03:23,503 or time demean away the endogeneity. 51 00:03:24,930 --> 00:03:27,160 One of the shortcomings of this 52 00:03:27,160 --> 00:03:29,850 is that the effect of any variable 53 00:03:29,850 --> 00:03:31,980 that doesn't change over time, 54 00:03:31,980 --> 00:03:36,140 especially dummy variables for gender or race 55 00:03:36,140 --> 00:03:41,140 or things like that that don't change is lost 56 00:03:41,630 --> 00:03:46,630 and it's a fact that cannot be predicted. 57 00:03:48,920 --> 00:03:53,920 So if you think that race is an important matter, 58 00:03:54,210 --> 00:03:59,000 it will be lost because presumably race or in most people, 59 00:03:59,000 --> 00:04:01,273 gender doesn't change over time. 60 00:04:03,020 --> 00:04:07,210 So this week we're looking at this method 61 00:04:07,210 --> 00:04:09,240 of instrumental variables 62 00:04:09,240 --> 00:04:13,150 as a way to deal with omitted variables 63 00:04:13,150 --> 00:04:14,630 that are correlated 64 00:04:21,067 --> 00:04:24,734 with the regressors and with the error term. 65 00:04:30,320 --> 00:04:33,240 So again, what we're learning this week 66 00:04:33,240 --> 00:04:37,830 is what to do in the presence of omitted variables 67 00:04:37,830 --> 00:04:40,730 and what to do if there was a measurement error 68 00:04:40,730 --> 00:04:42,800 in your variables. 69 00:04:42,800 --> 00:04:46,530 And it also sets the stage for simultaneous equations, 70 00:04:46,530 --> 00:04:48,243 which we will learn next week. 71 00:04:57,630 --> 00:05:02,630 So far, we have learned three ways 72 00:05:02,770 --> 00:05:05,990 to deal with omitted variables. 73 00:05:05,990 --> 00:05:08,830 First is just to ignore it, 74 00:05:08,830 --> 00:05:11,290 which we know will result in bias. 75 00:05:11,290 --> 00:05:14,470 Next is to find and use a proxy. 76 00:05:14,470 --> 00:05:18,873 So find another variable that we can put in place for that, 77 00:05:19,720 --> 00:05:22,713 which may not always be possible, 78 00:05:25,490 --> 00:05:30,490 because if you forgot to ask some question on a survey, 79 00:05:31,030 --> 00:05:36,030 it's likely that you also forgot to ask that proxy variable, 80 00:05:36,750 --> 00:05:38,544 or they're just simply may not 81 00:05:38,544 --> 00:05:41,247 (static interference drowns out speaker) 82 00:05:41,247 --> 00:05:45,680 last time we learned two first difference 83 00:05:45,680 --> 00:05:49,920 or fixed effects time demean it away, 84 00:05:49,920 --> 00:05:52,593 and it subtracts out of the equation. 85 00:05:54,890 --> 00:05:59,630 So let's look at an example of an omitted variable. 86 00:05:59,630 --> 00:06:03,540 Let's say that we are modeling, one of my favorite things, 87 00:06:03,540 --> 00:06:04,880 local food expenditure. 88 00:06:04,880 --> 00:06:09,113 How much do people spend on local food say per month? 89 00:06:09,970 --> 00:06:12,500 So there's this model here 90 00:06:12,500 --> 00:06:17,500 where local food expenditure is a function of income. 91 00:06:17,950 --> 00:06:22,430 So probably folks who have higher income 92 00:06:22,430 --> 00:06:27,430 would be more able to afford food, 93 00:06:27,460 --> 00:06:30,350 you know, with the attributes that they want, 94 00:06:30,350 --> 00:06:32,640 as well as psychographics. 95 00:06:32,640 --> 00:06:35,470 So their sort of attitudes and beliefs, 96 00:06:35,470 --> 00:06:37,690 just sort of this attribute of being a foodie, 97 00:06:37,690 --> 00:06:41,160 being a local foodie, thinking that's important. 98 00:06:41,160 --> 00:06:45,463 But there is no simple, straightforward variable, 99 00:06:46,470 --> 00:06:48,740 which measures psychographics. 100 00:06:48,740 --> 00:06:53,420 There's not a psychographicometer that we can measure 101 00:06:53,420 --> 00:06:55,820 where someone lies on that. 102 00:06:55,820 --> 00:06:58,590 And so if we don't account for it, 103 00:06:58,590 --> 00:07:00,570 if we don't think about it at all, 104 00:07:00,570 --> 00:07:04,900 it will go into the error term. 105 00:07:04,900 --> 00:07:09,270 So this psychographics is then part of the error term. 106 00:07:09,270 --> 00:07:12,430 And I think there's good reason to think 107 00:07:12,430 --> 00:07:15,183 that it would be correlated with income. 108 00:07:16,400 --> 00:07:21,400 That higher income folks have the means to care about 109 00:07:22,420 --> 00:07:27,420 and think about and act upon expressing their values 110 00:07:29,360 --> 00:07:32,860 in the marketplace and voting with their dollars 111 00:07:32,860 --> 00:07:34,340 and all of that, 112 00:07:34,340 --> 00:07:39,340 as opposed to, I would think, lower income folks would, 113 00:07:39,390 --> 00:07:43,777 that simply would not be an important thing, 114 00:07:45,640 --> 00:07:49,160 and they would have another set of psychographics 115 00:07:49,160 --> 00:07:51,930 of attitudes and beliefs 116 00:07:51,930 --> 00:07:54,460 about what kind of food is important. 117 00:07:54,460 --> 00:07:58,980 So by omitting it, it would be in the error term, 118 00:07:58,980 --> 00:08:00,480 and for reasons, I just said, 119 00:08:00,480 --> 00:08:02,710 it's correlated with income. 120 00:08:02,710 --> 00:08:04,900 And as we learned long ago, 121 00:08:04,900 --> 00:08:08,363 omitted variables create bias. 122 00:08:09,700 --> 00:08:12,610 One option is we could use a proxy. 123 00:08:12,610 --> 00:08:17,240 So if we knew whether they were a member of a co-op 124 00:08:17,240 --> 00:08:20,450 that we could just sort of plaster that in place 125 00:08:20,450 --> 00:08:22,920 for psychographics, 126 00:08:22,920 --> 00:08:25,573 thinking there's probably some overlap there, 127 00:08:27,130 --> 00:08:32,130 or we can find an instrument for income, 128 00:08:34,490 --> 00:08:39,490 which is likely not correlated with psychographics. 129 00:08:40,370 --> 00:08:42,010 So that's the key here 130 00:08:42,010 --> 00:08:47,010 is we have to find a new variable for income 131 00:08:48,490 --> 00:08:53,270 that is not correlated with psychographics, 132 00:08:53,270 --> 00:08:57,290 and that might be something like hours worked per week. 133 00:08:57,290 --> 00:09:02,240 That while that does have some overlap with income, 134 00:09:02,240 --> 00:09:07,010 it is less likely to have a lot of overlap 135 00:09:07,010 --> 00:09:08,663 with psychographics. 136 00:09:09,770 --> 00:09:13,170 So that might be one way to address it, 137 00:09:13,170 --> 00:09:18,093 and we're gonna look at some more specific cases here. 138 00:09:20,449 --> 00:09:25,290 So let's say that we are dealing with a model like this. 139 00:09:25,290 --> 00:09:28,200 So back to our local food expenditure, 140 00:09:28,200 --> 00:09:30,230 our X1 is income. 141 00:09:30,230 --> 00:09:34,330 We know that psychographics are important. 142 00:09:34,330 --> 00:09:35,910 So we can say, 143 00:09:35,910 --> 00:09:40,560 we can assume that the covariance of the income, 144 00:09:40,560 --> 00:09:44,790 which is X1 and our error term does not equal zero. 145 00:09:46,950 --> 00:09:49,320 X1 is endogenous. 146 00:09:49,320 --> 00:09:51,290 If it were exogenous, 147 00:09:51,290 --> 00:09:54,603 we could use OLS and get an unbiased estimate. 148 00:09:55,850 --> 00:10:00,850 So what we need is an instrument such as in the last slide, 149 00:10:01,610 --> 00:10:05,280 where we talked about hours per week, 150 00:10:05,280 --> 00:10:09,140 which is related to X, 151 00:10:09,140 --> 00:10:12,833 but has no relationship to U. 152 00:10:15,720 --> 00:10:17,313 More precisely, 153 00:10:18,190 --> 00:10:21,210 an instrument must have these two attributes. 154 00:10:21,210 --> 00:10:24,030 There must be two assumptions here. 155 00:10:24,030 --> 00:10:27,883 If Z is to be an instrumental variable, 156 00:10:29,410 --> 00:10:34,070 first, its covariance with U must equals zero, 157 00:10:34,070 --> 00:10:36,920 Z must be exogenous. 158 00:10:36,920 --> 00:10:40,380 It must have instrumental exogeneity. 159 00:10:40,380 --> 00:10:45,380 And next, Z and X, the regressor 160 00:10:45,630 --> 00:10:48,030 have to have some relationship 161 00:10:48,030 --> 00:10:52,700 that their covariance must be not equal to zero. 162 00:10:52,700 --> 00:10:55,400 That imagining a Venn diagram, 163 00:10:55,400 --> 00:10:59,850 Z and U don't touch, but Z and X do touch. 164 00:10:59,850 --> 00:11:01,940 And in this way that we're saying 165 00:11:01,940 --> 00:11:05,100 that it has instrumental exogeneity 166 00:11:05,100 --> 00:11:07,313 and instrumental relevance. 167 00:11:09,800 --> 00:11:11,563 So laying it out here. 168 00:11:12,570 --> 00:11:17,260 The first attribute that we need or the first assumption 169 00:11:17,260 --> 00:11:21,870 is that the covariance of Z and U equals zero. 170 00:11:21,870 --> 00:11:25,390 So Z has this instrumental exogeneity. 171 00:11:25,390 --> 00:11:28,180 It is exogenous and so it'll work. 172 00:11:28,180 --> 00:11:33,180 And next, it must have, Z must have instrumental relevance. 173 00:11:34,060 --> 00:11:38,370 That it has to have some sort of overlap with X 174 00:11:38,370 --> 00:11:39,880 on our Venn diagram. 175 00:11:39,880 --> 00:11:44,000 That if these two are completely unrelated, 176 00:11:44,000 --> 00:11:46,863 it won't be an effective instrument. 177 00:11:48,110 --> 00:11:50,330 Note that we cannot test 178 00:11:50,330 --> 00:11:53,720 whether Z and U have zero covariance 179 00:11:53,720 --> 00:11:55,830 since U is not observed. 180 00:11:55,830 --> 00:11:57,580 So here's another case 181 00:11:57,580 --> 00:12:02,240 where we have to appeal to economic theory, 182 00:12:02,240 --> 00:12:05,730 introspection, common sense, prior studies, 183 00:12:05,730 --> 00:12:07,980 that we sort of use our brains, 184 00:12:07,980 --> 00:12:12,980 look at past research theory and come up with an instrument 185 00:12:14,170 --> 00:12:16,010 that we can make a good case, 186 00:12:16,010 --> 00:12:19,423 has this instrumental exogeneity. 187 00:12:23,530 --> 00:12:28,530 We can, however, test for the instrumental relevance, 188 00:12:29,780 --> 00:12:34,780 that is, is the covariance of Z and X equal to zero or not? 189 00:12:38,280 --> 00:12:43,280 Here, you would simply regress one on the other take X, 190 00:12:44,730 --> 00:12:49,370 which is our endogenous regressor 191 00:12:49,370 --> 00:12:54,060 and make that our dependent and regress it on Z. 192 00:12:54,060 --> 00:12:57,020 The instrument that we will wanna use. 193 00:12:57,020 --> 00:13:02,003 We put forth the null hypothesis at this Pi1 equals zero. 194 00:13:05,180 --> 00:13:07,100 We use a T-test. 195 00:13:07,100 --> 00:13:09,810 We hope to reject the null. 196 00:13:09,810 --> 00:13:14,810 We hope that Z does have a significant effect on X, 197 00:13:15,990 --> 00:13:19,156 and therefore, if we have a big 198 00:13:19,156 --> 00:13:22,162 (static interference drowns out speaker) our null, 199 00:13:22,162 --> 00:13:25,733 then we can use Z as an instrument. 200 00:13:28,560 --> 00:13:31,920 So note that as always, 201 00:13:31,920 --> 00:13:34,220 this sort of eliminating bias 202 00:13:34,220 --> 00:13:39,010 comes with a cost of efficiency. 203 00:13:39,010 --> 00:13:43,230 So here's sort of the theoretical formulation 204 00:13:43,230 --> 00:13:46,173 of the variance of the beta IV, 205 00:13:47,230 --> 00:13:51,110 the instrumental variable beta. 206 00:13:51,110 --> 00:13:56,110 And note that once again, 207 00:13:56,410 --> 00:14:01,410 the more information that there is in X, 208 00:14:01,490 --> 00:14:03,630 so this is Sigma squared X, 209 00:14:03,630 --> 00:14:08,220 how much information, what's the variance here? 210 00:14:08,220 --> 00:14:10,940 Since that's in the denominator, 211 00:14:10,940 --> 00:14:15,320 that makes the variance smaller, 212 00:14:15,320 --> 00:14:20,320 as N increases, that makes the variance smaller. 213 00:14:22,570 --> 00:14:25,710 And finally, there's this rho squared, 214 00:14:25,710 --> 00:14:30,710 which is how correlated are X and Z. 215 00:14:30,850 --> 00:14:35,850 So not surprisingly, as this rho increases, 216 00:14:36,820 --> 00:14:41,820 which is X and Z have a lot in common. 217 00:14:41,860 --> 00:14:46,420 The more in common they have, the bigger this rho squared. 218 00:14:46,420 --> 00:14:49,530 Since it is in the denominator, 219 00:14:49,530 --> 00:14:52,900 that makes the overall variance smaller. 220 00:14:52,900 --> 00:14:56,080 And again, as N increases, that also makes it. 221 00:14:56,080 --> 00:14:59,870 So here is three places 222 00:14:59,870 --> 00:15:04,870 where more information implies lower variance. 223 00:15:10,280 --> 00:15:14,870 The way that we would actually estimate this is through, 224 00:15:14,870 --> 00:15:17,110 so note we're doing the variance hat. 225 00:15:17,110 --> 00:15:21,823 How do we estimate the variance of our beta IV? 226 00:15:22,780 --> 00:15:27,373 So that would be in the numerator, Sigma squared hat, 227 00:15:29,818 --> 00:15:31,080 which we learned way back, 228 00:15:31,080 --> 00:15:36,080 is a function of the error terms, the SSTx, 229 00:15:37,310 --> 00:15:40,280 which is familiar as before, 230 00:15:40,280 --> 00:15:43,500 that is how spread out our X is, 231 00:15:43,500 --> 00:15:46,990 what's the total sum of squares of X? 232 00:15:46,990 --> 00:15:51,820 And finally, the R squared of X and Z. 233 00:15:51,820 --> 00:15:56,820 So here we want the R squared of X and Z to be big, 234 00:15:57,420 --> 00:16:01,530 that these really have a lot of overlap, 235 00:16:01,530 --> 00:16:06,090 that Z does a good job of explaining X, 236 00:16:06,090 --> 00:16:10,850 because then we have a big R squared in the denominator, 237 00:16:10,850 --> 00:16:13,263 which makes the variance smaller. 238 00:16:17,540 --> 00:16:21,960 Think about which one has a larger variance, 239 00:16:21,960 --> 00:16:25,223 OLS or IV, and why? 240 00:16:27,130 --> 00:16:28,150 Note that 241 00:16:32,760 --> 00:16:37,050 since the R squared is less than one, 242 00:16:37,050 --> 00:16:40,960 that we're going to be dividing by a number less than one, 243 00:16:40,960 --> 00:16:45,070 which means multiplying by a number greater than one. 244 00:16:45,070 --> 00:16:48,910 But note that the higher the R squared, 245 00:16:48,910 --> 00:16:51,523 the lower the variance. 246 00:16:53,230 --> 00:16:55,740 The better the instrument, 247 00:16:55,740 --> 00:17:00,000 the more that Z is correlated with X, 248 00:17:00,000 --> 00:17:02,863 the lower variance that we have. 249 00:17:06,110 --> 00:17:10,170 So again, if our R squared is small, 250 00:17:10,170 --> 00:17:15,170 if Z is a poor instrument for X, 251 00:17:15,970 --> 00:17:20,573 we're going to be dividing by a very small number, 252 00:17:22,650 --> 00:17:23,773 less than one, 253 00:17:25,722 --> 00:17:29,230 which means that we're multiplying by a larger number 254 00:17:29,230 --> 00:17:30,630 greater than one, 255 00:17:30,630 --> 00:17:34,143 which means that it will inflate the variance. 256 00:17:36,200 --> 00:17:40,184 So again, less information, more variance. 257 00:17:40,184 --> 00:17:45,184 The less new information that Z provides about X, 258 00:17:45,810 --> 00:17:46,800 the higher 259 00:17:51,652 --> 00:17:54,819 the estimated variance of our beta IV. 260 00:17:59,600 --> 00:18:04,350 Two basic definitions for this week's work 261 00:18:04,350 --> 00:18:08,360 are instrumental variables and two-staged least squares. 262 00:18:08,360 --> 00:18:11,000 So we've already talked about 263 00:18:11,000 --> 00:18:14,150 what an instrumental variable is. 264 00:18:14,150 --> 00:18:17,720 That's one of these Z variables, 265 00:18:17,720 --> 00:18:21,220 which allow us to replace it. 266 00:18:21,220 --> 00:18:26,220 It can, in a sense, stand in for the endogenous regressors, 267 00:18:26,650 --> 00:18:28,290 and in this way, 268 00:18:28,290 --> 00:18:32,263 provide for an unbiased estimate of our betas. 269 00:18:34,040 --> 00:18:37,470 Two-staged least squares is the technique 270 00:18:37,470 --> 00:18:42,470 by which we let this instrumental variable stand in 271 00:18:42,910 --> 00:18:46,390 for the endogenous one 272 00:18:46,390 --> 00:18:51,307 and allows for unbiased estimates of the betas. 273 00:18:57,090 --> 00:19:00,480 So let's say that we have a model here 274 00:19:00,480 --> 00:19:05,480 where we have two regressors, Y2 and Z1. 275 00:19:07,510 --> 00:19:09,790 I am using the Wooldridge 276 00:19:09,790 --> 00:19:14,790 (static interference drowns out speaker) notation 277 00:19:15,480 --> 00:19:19,000 that when you have a Y as a regressor, 278 00:19:19,000 --> 00:19:21,230 it is assumed to be endogenous, 279 00:19:21,230 --> 00:19:24,780 and we'll see in simultaneous equations why. 280 00:19:24,780 --> 00:19:29,160 And then Z is exogenist. 281 00:19:29,160 --> 00:19:32,350 So this is a structural equation. 282 00:19:32,350 --> 00:19:37,010 This is the model that we are interested in. 283 00:19:37,010 --> 00:19:41,620 And basically, what we wanna know is beta one. 284 00:19:41,620 --> 00:19:46,620 What is the change in Y1, when we change Y2 much as before? 285 00:19:51,180 --> 00:19:54,360 But now we are concerned that, 286 00:19:54,360 --> 00:19:59,360 or we've done a test and found out that Y2 is endogenous, 287 00:20:02,390 --> 00:20:05,860 and we really wanna know beta one, 288 00:20:05,860 --> 00:20:10,470 but we know that if we do it just through regular OLS, 289 00:20:10,470 --> 00:20:14,070 that we will have a biased estimate of beta one. 290 00:20:14,070 --> 00:20:19,070 So we're gonna learn how to purge the endogeneity 291 00:20:19,650 --> 00:20:23,920 and come up with an unbiased estimate for beta one. 292 00:20:23,920 --> 00:20:28,460 We're assuming that Z1 is exogenous. 293 00:20:28,460 --> 00:20:32,480 And since we (clears throat) need to use 294 00:20:37,590 --> 00:20:40,630 an instrumental variable, 295 00:20:40,630 --> 00:20:41,990 we need another one. 296 00:20:41,990 --> 00:20:46,230 So we needed to have asked another question 297 00:20:46,230 --> 00:20:49,370 on our survey, Z2 298 00:20:49,370 --> 00:20:53,300 that does not appear in the structural equation. 299 00:20:53,300 --> 00:20:58,300 So we need an instrument that's not in this original model. 300 00:20:58,880 --> 00:21:02,730 So Z2 is our instrumental variable 301 00:21:02,730 --> 00:21:07,730 and we will use it in the two-staged least squares. 302 00:21:10,240 --> 00:21:14,693 We do this by creating what's called a reduced, 303 00:21:15,810 --> 00:21:19,080 again, this is a reduced form equation. 304 00:21:19,080 --> 00:21:21,130 A reduced form equation 305 00:21:21,130 --> 00:21:26,130 is where you have the endogenous variable on the left side 306 00:21:26,720 --> 00:21:31,560 and the exogenous variables on the right side. 307 00:21:31,560 --> 00:21:34,250 In this case, there's only two. 308 00:21:34,250 --> 00:21:35,810 So we started off with Z1 309 00:21:35,810 --> 00:21:38,300 from the original structural model, 310 00:21:38,300 --> 00:21:41,850 and Z2, which was the instrument that we had, 311 00:21:41,850 --> 00:21:45,000 that we were saving in our back pockets 312 00:21:45,000 --> 00:21:48,310 to use here in the reduced form. 313 00:21:48,310 --> 00:21:53,260 You can also have a Z3, Z4 as many as you wish, 314 00:21:53,260 --> 00:21:58,260 as long as they meet the two requirements 315 00:21:58,370 --> 00:22:02,550 that they are in themselves, exogenous. 316 00:22:02,550 --> 00:22:05,000 And again, at least one of them 317 00:22:05,000 --> 00:22:09,573 has to have some overlap with Y2 for this to work at all. 318 00:22:15,300 --> 00:22:17,930 In the method of two-staged least squares, 319 00:22:17,930 --> 00:22:22,603 we literally run two regressions. 320 00:22:25,540 --> 00:22:28,210 Conceptually, we're starting off 321 00:22:28,210 --> 00:22:30,800 with this same structural model. 322 00:22:30,800 --> 00:22:35,800 We have Y1 on the left side and we have two regressors, 323 00:22:36,408 --> 00:22:38,553 Y2 and Z1. 324 00:22:40,440 --> 00:22:45,440 Note that again, Y2 is an endogenous regressor, 325 00:22:46,270 --> 00:22:50,860 and what we really wanna know about for our forecasting, 326 00:22:50,860 --> 00:22:55,690 for our model, for whatever we're doing here is beta one. 327 00:22:55,690 --> 00:22:58,000 We want a good estimate of beta one. 328 00:22:58,000 --> 00:23:01,750 How does Y1 change as Y2 changes? 329 00:23:01,750 --> 00:23:04,530 But we know if we just do it through straight OLS, 330 00:23:04,530 --> 00:23:08,270 we will get a biased estimate. 331 00:23:08,270 --> 00:23:13,270 Now, we have two exogenous variables, Z2 and Z3. 332 00:23:16,580 --> 00:23:19,750 Again, we asked them on the survey 333 00:23:19,750 --> 00:23:21,130 and we've been saving them. 334 00:23:21,130 --> 00:23:22,380 We've been hiding them away. 335 00:23:22,380 --> 00:23:25,220 We hid them in our back pockets 336 00:23:25,220 --> 00:23:28,600 'cause we thought that we might need them later. 337 00:23:28,600 --> 00:23:33,600 And the best instrumental variable 338 00:23:33,780 --> 00:23:38,580 is the linear combination of Z1, Z2 and Z3 339 00:23:38,580 --> 00:23:42,733 that does the best job of explaining Y2. 340 00:23:44,705 --> 00:23:46,190 So the first thing that we actually do 341 00:23:50,350 --> 00:23:53,250 is to run this structural, 342 00:23:53,250 --> 00:23:57,310 or rather we run the reduced form equation, 343 00:23:57,310 --> 00:23:59,373 which I'll show you on the next slide. 344 00:24:01,460 --> 00:24:05,230 So here is our reduced form equation 345 00:24:05,230 --> 00:24:09,320 where we have Y2 is now on the left side, 346 00:24:09,320 --> 00:24:14,320 and we run OLS as a function of Z1, Z2 and Z3. 347 00:24:16,710 --> 00:24:21,710 It must be that either Pi2 or Pi3 must not equals zero, 348 00:24:23,590 --> 00:24:26,283 that they cannot jointly equals zero. 349 00:24:28,000 --> 00:24:30,193 So we run this regression, 350 00:24:31,050 --> 00:24:33,620 and then we do an F-test, 351 00:24:33,620 --> 00:24:38,223 where our null is that Pi2 equals Pi3 equals zero. 352 00:24:39,330 --> 00:24:43,660 And again, hopefully, when we run this F-test, 353 00:24:43,660 --> 00:24:48,140 we get a nice big F-stat, 354 00:24:48,140 --> 00:24:51,460 which allows us to reject our null, that they are not, 355 00:24:51,460 --> 00:24:53,750 so we're rejecting the null 356 00:24:53,750 --> 00:24:56,440 that they are jointly equal to zero, 357 00:24:56,440 --> 00:25:01,290 and therefore they have instrumental relevance 358 00:25:01,290 --> 00:25:03,670 and this will work. 359 00:25:03,670 --> 00:25:08,303 So what you do is you run this reduced form equation, 360 00:25:09,220 --> 00:25:13,500 and you save the predicted values, Y hat. 361 00:25:13,500 --> 00:25:17,140 So you save everybody's Y2 hat 362 00:25:17,140 --> 00:25:20,120 from this reduced form equation. 363 00:25:20,120 --> 00:25:22,750 When we do this in SPSS, I'll show you how, 364 00:25:22,750 --> 00:25:25,040 but you already know how to do that, 365 00:25:25,040 --> 00:25:27,010 how to save Y hats. 366 00:25:27,010 --> 00:25:29,233 So you save these Y hats. 367 00:25:30,420 --> 00:25:35,420 And then we go back to the original structural equation, 368 00:25:35,900 --> 00:25:40,900 and we replace Y2 hat in place of this Y2. 369 00:25:44,850 --> 00:25:49,850 In this way, we have just purged the Y2 of the endogeneity, 370 00:25:51,070 --> 00:25:56,070 because a linear combination of all exogenous Xs 371 00:25:58,390 --> 00:26:02,030 is therefore itself exogenous. 372 00:26:02,030 --> 00:26:07,030 So as long as Z1, Z2 and Z3 are in themselves exogenous, 373 00:26:11,520 --> 00:26:14,940 if their co-variance with the error term 374 00:26:14,940 --> 00:26:18,873 in the original structural equation equals zero, 375 00:26:20,010 --> 00:26:24,320 if they have this, they are exogenous, 376 00:26:24,320 --> 00:26:29,320 then a linear combination of them is also exogenous, 377 00:26:29,610 --> 00:26:33,330 since there's no error term in them 378 00:26:33,330 --> 00:26:38,330 making a linear combination of them will still be exogenous. 379 00:26:43,540 --> 00:26:48,540 Note that the variance of our two-stage least squares beta 380 00:26:51,300 --> 00:26:56,223 will be greater than the variance of beta OLS. 381 00:26:57,170 --> 00:27:00,293 Take a moment and think about why this might be so. 382 00:27:04,640 --> 00:27:08,330 And think about our old friend, 383 00:27:08,330 --> 00:27:12,110 more information means less variance. 384 00:27:12,110 --> 00:27:15,763 Why is there more information in beta OLS? 385 00:27:17,530 --> 00:27:19,200 Well, here's why. 386 00:27:19,200 --> 00:27:23,330 Two very closely related facts. 387 00:27:23,330 --> 00:27:28,330 One is that Y2 hat has less variability than Y2, 388 00:27:30,570 --> 00:27:34,860 that we're only taking that part of Y2, 389 00:27:34,860 --> 00:27:38,237 which has overlap with our Zs. 390 00:27:41,470 --> 00:27:43,510 So we're losing a lot of information there. 391 00:27:43,510 --> 00:27:46,600 All the information of Y2 392 00:27:46,600 --> 00:27:51,100 that imagining our Venn diagram is outside of, 393 00:27:51,100 --> 00:27:52,640 or doesn't touch our Zs, 394 00:27:52,640 --> 00:27:54,710 all of that is lost. 395 00:27:54,710 --> 00:27:59,710 Plus, Y2 is going to be pretty highly colinear 396 00:28:01,010 --> 00:28:03,530 with our existing Zs, 397 00:28:03,530 --> 00:28:08,530 and therefore in both cases, 398 00:28:08,530 --> 00:28:13,130 less information leads to higher variance. 399 00:28:13,130 --> 00:28:15,120 So this is a classic case 400 00:28:15,120 --> 00:28:20,120 where we are sacrificing efficiency to get rid of bias. 401 00:28:21,590 --> 00:28:23,510 And this is again, 402 00:28:23,510 --> 00:28:26,950 a trade-off we're going to be revisiting 403 00:28:26,950 --> 00:28:31,950 most of the rest of the semester, 404 00:28:32,120 --> 00:28:37,120 that when we purge or when we get rid of bias, 405 00:28:37,250 --> 00:28:40,833 it comes at a cost of a less efficient estimator. 406 00:28:42,680 --> 00:28:47,360 There is an order condition for this whole thing to work, 407 00:28:47,360 --> 00:28:52,087 to be able to identify this Y2 hat. 408 00:28:55,730 --> 00:29:00,730 We need at least as many excluded endogenous variables 409 00:29:01,900 --> 00:29:04,630 as we have endogenous. 410 00:29:04,630 --> 00:29:06,069 So note that, 411 00:29:06,069 --> 00:29:10,540 remember, the instrumental variable 412 00:29:10,540 --> 00:29:14,390 does not appear in the structural equation 413 00:29:14,390 --> 00:29:16,020 that we save them, 414 00:29:16,020 --> 00:29:18,650 we stuck them in our back pockets 415 00:29:18,650 --> 00:29:20,800 for when we would need them. 416 00:29:20,800 --> 00:29:25,210 And we need at least as many of those Zs 417 00:29:25,210 --> 00:29:28,440 that are not in the structural equation 418 00:29:28,440 --> 00:29:32,950 as we have Ys which are regressors. 419 00:29:36,700 --> 00:29:39,700 So think about the number of Zs 420 00:29:39,700 --> 00:29:42,340 that we saved in our back pockets, 421 00:29:42,340 --> 00:29:46,900 and the number of Ys that appear as regressors 422 00:29:46,900 --> 00:29:49,670 in our structural equation. 423 00:29:49,670 --> 00:29:54,670 If the number of Ys is greater than the number of Zs, 424 00:29:55,490 --> 00:29:57,850 this is under identified, 425 00:29:57,850 --> 00:29:59,890 and we cannot run this model. 426 00:29:59,890 --> 00:30:04,890 So if we have two Ys as regressors say, Y2, Y3, 427 00:30:06,640 --> 00:30:08,993 and we only saved one Z, 428 00:30:10,150 --> 00:30:13,760 that's not in the structural equation, that's not enough. 429 00:30:13,760 --> 00:30:16,873 This is not identified and it won't work. 430 00:30:18,170 --> 00:30:20,700 If we have the exact same number 431 00:30:21,800 --> 00:30:24,640 where the number of Zs we saved 432 00:30:24,640 --> 00:30:28,980 and the number of Ys which are regressors, 433 00:30:28,980 --> 00:30:32,063 it's just identified and that's fine. 434 00:30:33,030 --> 00:30:36,340 And if there are more Zs than Ys, 435 00:30:36,340 --> 00:30:38,260 then that's fine and that's actually good 436 00:30:38,260 --> 00:30:40,670 because you're gonna have more information 437 00:30:40,670 --> 00:30:44,900 that goes into these Y hats when we do the second stage 438 00:30:44,900 --> 00:30:46,360 of two-staged least squares, 439 00:30:46,360 --> 00:30:47,800 but you'll have to have 440 00:30:47,800 --> 00:30:52,800 at least the same number, if not more. 441 00:30:57,930 --> 00:31:02,930 This method can also help us address measurement errors. 442 00:31:03,160 --> 00:31:06,453 So let's say that you ran a regression. 443 00:31:07,467 --> 00:31:09,140 So you ask three questions, 444 00:31:09,140 --> 00:31:12,133 Y, X1, X2 as before, 445 00:31:13,660 --> 00:31:18,660 the respondents were able to accurately answer Y and X2, 446 00:31:21,250 --> 00:31:26,250 but for some reason, they did something wrong in X1. 447 00:31:26,590 --> 00:31:28,150 It's not observed, 448 00:31:28,150 --> 00:31:29,550 they screw up, 449 00:31:29,550 --> 00:31:32,353 somehow they did not give, 450 00:31:33,660 --> 00:31:36,910 either we didn't ask it at all, 451 00:31:36,910 --> 00:31:38,950 or when we asked it 452 00:31:38,950 --> 00:31:43,080 we asked it in a way that was confusing or biased 453 00:31:44,863 --> 00:31:45,696 or led to 454 00:31:49,240 --> 00:31:51,790 strategic responses, 455 00:31:51,790 --> 00:31:56,520 but in any case, they did not answer truthfully. 456 00:31:56,520 --> 00:31:59,100 And actually we observe X1, 457 00:31:59,100 --> 00:32:01,440 which we know has an error. 458 00:32:01,440 --> 00:32:06,087 So if we do a bunch of substitution, 459 00:32:07,450 --> 00:32:11,530 you can see that this will appear in the error term, 460 00:32:11,530 --> 00:32:14,870 and that 461 00:32:15,840 --> 00:32:20,230 X1 is now endogenous, 462 00:32:20,230 --> 00:32:22,980 and if we use straight OLS, 463 00:32:22,980 --> 00:32:27,980 that we will come up with a biased estimate of beta one. 464 00:32:29,460 --> 00:32:34,460 So I actually, I wanna give you a concrete example 465 00:32:35,230 --> 00:32:40,000 of when this could have been used in my work. 466 00:32:40,000 --> 00:32:43,410 So back when I worked in Michigan, 467 00:32:43,410 --> 00:32:48,360 we did the equivalent of the Vermonter Poll. 468 00:32:48,360 --> 00:32:49,970 So I think it was a, 469 00:32:49,970 --> 00:32:52,280 I forget what it was called, 470 00:32:52,280 --> 00:32:57,280 but it was the university drew a random sample, 471 00:32:59,000 --> 00:33:01,040 asked a bunch of questions, 472 00:33:01,040 --> 00:33:03,990 researchers like me could buy questions on it 473 00:33:03,990 --> 00:33:05,940 and we got the data. 474 00:33:05,940 --> 00:33:10,940 So the subject of this project was farmers' markets. 475 00:33:11,230 --> 00:33:12,903 And we asked folks, 476 00:33:14,170 --> 00:33:19,170 we went back to August of the last year and said, 477 00:33:19,830 --> 00:33:23,580 how much did you spend at farmers' markets 478 00:33:23,580 --> 00:33:27,130 in August of whatever the year? 479 00:33:27,130 --> 00:33:32,130 I think it was 2007, say. 480 00:33:33,150 --> 00:33:37,660 And when you added these all up and extrapolated them out, 481 00:33:37,660 --> 00:33:41,880 the respondents grossly overestimated 482 00:33:41,880 --> 00:33:44,400 that when you looked at, 483 00:33:44,400 --> 00:33:49,310 if that's actually how much households spend 484 00:33:49,310 --> 00:33:51,740 times the number of households, 485 00:33:51,740 --> 00:33:56,270 it was a ridiculously large number. 486 00:33:56,270 --> 00:34:01,170 So due to something like social desirability bias 487 00:34:01,170 --> 00:34:04,010 or other factors, 488 00:34:04,010 --> 00:34:08,253 they really overestimated how much they spent. 489 00:34:09,820 --> 00:34:12,193 Luckily, we asked, 490 00:34:13,300 --> 00:34:18,300 how many times did you go to the farmers' market? 491 00:34:19,320 --> 00:34:24,320 So we were able to use this number of farmers' market trips 492 00:34:25,040 --> 00:34:29,013 as an estimate of how much they actually spent. 493 00:34:31,010 --> 00:34:33,950 I think most folks could pretty accurately, 494 00:34:33,950 --> 00:34:38,180 let's see, in August I went once or twice or three or four, 495 00:34:38,180 --> 00:34:39,900 or whatever, 496 00:34:39,900 --> 00:34:42,360 rather than, how much did you spend? 497 00:34:42,360 --> 00:34:47,360 Which we know was a gross overestimate. 498 00:34:51,750 --> 00:34:56,750 Here, we have another observation of X1 star. 499 00:35:01,070 --> 00:35:05,570 Here would be the example of number of trips. 500 00:35:05,570 --> 00:35:10,570 So as long as the Z1 is not correlated with the error term 501 00:35:12,730 --> 00:35:15,260 in the structural equation, 502 00:35:15,260 --> 00:35:20,260 that we could run a reduced form regression 503 00:35:24,500 --> 00:35:29,500 where our observed X1 is on the left side, 504 00:35:29,900 --> 00:35:33,210 our Z1 is on the right side, 505 00:35:33,210 --> 00:35:37,090 we save these X1 hats through two-staged least squares, 506 00:35:37,090 --> 00:35:41,870 put it back into the structural equation, 507 00:35:41,870 --> 00:35:45,550 and then that way we can learn what beta one is, 508 00:35:45,550 --> 00:35:50,550 that what is the effect of farmers' market expenditure 509 00:35:52,530 --> 00:35:53,503 on our Y? 510 00:35:58,810 --> 00:36:00,610 There is, as you might imagine, 511 00:36:00,610 --> 00:36:03,250 a test for endogeneity 512 00:36:03,250 --> 00:36:07,700 and it's a fairly straightforward one, I hope. 513 00:36:07,700 --> 00:36:12,010 So we start with this structural equation 514 00:36:12,010 --> 00:36:16,500 where we have a Y1 on the left side, 515 00:36:16,500 --> 00:36:20,653 and we have Y2, Z1 and Z2 as our regressors. 516 00:36:21,530 --> 00:36:26,040 We suspect that Y2 is endogenous. 517 00:36:26,040 --> 00:36:29,720 And again, our (indistinct) would come through 518 00:36:29,720 --> 00:36:34,080 economic theory, introspection, previous studies, 519 00:36:34,080 --> 00:36:35,110 common sense, 520 00:36:35,110 --> 00:36:36,470 but we have good reason 521 00:36:36,470 --> 00:36:37,653 to (static interference drowns out speaker) 522 00:36:37,653 --> 00:36:40,140 Y2 is endogenous. 523 00:36:40,140 --> 00:36:45,140 And we did save two potential instrument, Z3 and Z4 524 00:36:46,790 --> 00:36:50,280 that are not in the structural equation that we saved. 525 00:36:50,280 --> 00:36:52,440 We kept them in our back pocket 526 00:36:52,440 --> 00:36:55,633 so that we could do a reduced form model for Y2. 527 00:36:57,840 --> 00:37:02,230 And here is our reduced form equation for Y2. 528 00:37:02,230 --> 00:37:04,130 Note that it has Pis, 529 00:37:04,130 --> 00:37:07,260 which helps you know that it's a reduced form. 530 00:37:07,260 --> 00:37:12,260 So we wanna test if the V2, the error term from this one. 531 00:37:13,540 --> 00:37:18,540 So once we've already netted out all the exogeneity. 532 00:37:18,550 --> 00:37:21,540 So we've already purged Y2 533 00:37:23,429 --> 00:37:27,167 and we wanna know if the leftover error term V2 534 00:37:28,570 --> 00:37:30,163 is correlated with U1. 535 00:37:31,550 --> 00:37:35,270 And here, we run a regression. 536 00:37:35,270 --> 00:37:40,270 So we've saved the U1s from our original equation 537 00:37:40,930 --> 00:37:45,790 and we also save the V2 from the reduced form, 538 00:37:45,790 --> 00:37:50,790 and we obtain our V2 hat from the reduced form equation 539 00:37:53,350 --> 00:37:55,210 and we save that. 540 00:37:55,210 --> 00:38:00,210 So we go through and we save our residuals, 541 00:38:02,620 --> 00:38:06,130 which again, you know how to do in SPSS 542 00:38:06,130 --> 00:38:10,950 and I will show you how to do that when we do it, 543 00:38:10,950 --> 00:38:13,950 and we'll run through an exercise. 544 00:38:19,010 --> 00:38:22,110 So we've saved our V2 hat, 545 00:38:22,110 --> 00:38:27,110 and now we add that as a new regressor 546 00:38:28,300 --> 00:38:33,300 into our original structural model. 547 00:38:33,610 --> 00:38:38,210 And we do a T-test on this Delta one. 548 00:38:38,210 --> 00:38:41,430 And in many cases, it's best to use 549 00:38:41,430 --> 00:38:45,810 a heteroscedasticity robust T-test. 550 00:38:45,810 --> 00:38:50,810 So we would use the heteroscedastic robust standard errors 551 00:38:55,000 --> 00:38:56,890 to get our T-stat. 552 00:38:56,890 --> 00:39:00,080 So the way that you would interpret this 553 00:39:00,080 --> 00:39:03,500 is if Delta one, 554 00:39:03,500 --> 00:39:08,453 so our null hypothesis here is Delta one equals zero. 555 00:39:10,230 --> 00:39:14,880 A big T-stat means that we can reject the null, 556 00:39:14,880 --> 00:39:19,880 and in this case, we hope that we fail to reject the null, 557 00:39:23,870 --> 00:39:26,290 that if we fail to reject, 558 00:39:26,290 --> 00:39:31,093 that means that V hat has no overlap with Y1, 559 00:39:32,450 --> 00:39:36,330 it means that Y2 is exogenous 560 00:39:36,330 --> 00:39:41,120 and we can just run OLS as before. 561 00:39:41,120 --> 00:39:45,470 Whereas if we run the test, 562 00:39:45,470 --> 00:39:49,220 we have a big T-stat, we fail to reject, 563 00:39:49,220 --> 00:39:53,167 then we conclude that Y2 is endogenous 564 00:39:54,215 --> 00:39:59,215 and we have to go the two-staged least squares route. 565 00:39:59,440 --> 00:40:01,990 So that is the end of the slides. 566 00:40:01,990 --> 00:40:06,030 And I hope you have a chance to look over these, 567 00:40:06,030 --> 00:40:08,880 and please write down things that don't make sense 568 00:40:08,880 --> 00:40:11,300 and I will spin through them 569 00:40:12,760 --> 00:40:16,110 hopefully fairly quickly on Friday, 570 00:40:16,110 --> 00:40:18,510 I mean, sorry, on Wednesday, 571 00:40:18,510 --> 00:40:23,510 and then we'll do SPSS on Friday, 572 00:40:25,320 --> 00:40:29,490 and then we'll be done with IV two-staged least squares. 573 00:40:29,490 --> 00:40:31,880 So I hope you are all well, 574 00:40:31,880 --> 00:40:36,880 and I enjoy seeing you on Microsoft Teams, 575 00:40:38,690 --> 00:40:42,780 but no, I really, really still miss seeing all of you 576 00:40:42,780 --> 00:40:47,300 in person. 577 00:40:47,300 --> 00:40:50,280 Take care of yourselves, take care of each other, 578 00:40:50,280 --> 00:40:52,140 check in on each other. 579 00:40:52,140 --> 00:40:54,330 Let me know if you need anything. 580 00:40:54,330 --> 00:40:59,330 And, you know, again, in my role as your professor, 581 00:41:00,400 --> 00:41:05,210 in my role as the coordinator of the MS program, 582 00:41:05,210 --> 00:41:10,210 or just in my role of someone who cares about you, 583 00:41:11,890 --> 00:41:14,970 let me know if you need anything 584 00:41:14,970 --> 00:41:19,420 and I'll see all of your electrons on Wednesday. 585 00:41:19,420 --> 00:41:20,253 Thank you.