WEBVTT 1 00:00:02.790 --> 00:00:07.020 Hello and welcome to the video lecture on statistics, 2 00:00:07.020 --> 00:00:10.650 which is the next phase in how to do 3 00:00:10.650 --> 00:00:14.253 the analysis of quantitative data. 4 00:00:15.390 --> 00:00:19.920 And specifically, we're gonna look at how can you infer 5 00:00:20.850 --> 00:00:24.770 to either larger populations 6 00:00:27.330 --> 00:00:32.330 or how can you test if a relationship that you see 7 00:00:32.910 --> 00:00:37.910 is a real one or a fluke, for lack of a better word. 8 00:00:41.340 --> 00:00:46.340 So we're gonna look again at univariate, bivariate 9 00:00:49.200 --> 00:00:54.200 and regression and how we can apply statistical analysis 10 00:00:55.860 --> 00:00:59.790 and look for what we call statistical significance, 11 00:00:59.790 --> 00:01:03.540 which is a high degree of confidence 12 00:01:03.540 --> 00:01:07.803 that a relationship that we see is real. 13 00:01:11.700 --> 00:01:14.820 So again, as we learned last time, 14 00:01:14.820 --> 00:01:19.743 univariate analysis describes a single variable. 15 00:01:22.650 --> 00:01:24.780 It could be frequencies, 16 00:01:24.780 --> 00:01:29.643 it can be central tendency, like mean, et cetera, 17 00:01:32.340 --> 00:01:34.200 Bivariate and multivariate, 18 00:01:34.200 --> 00:01:38.043 it is often in the form of cross tabulations. 19 00:01:40.470 --> 00:01:43.860 And looking at the relationship of how 20 00:01:43.860 --> 00:01:48.860 the way your respondents answered one question, 21 00:01:51.150 --> 00:01:56.150 if there's like a pattern in how they answered another. 22 00:01:56.879 --> 00:02:01.413 And again, in many cases this is a, 23 00:02:02.370 --> 00:02:06.180 with the independent variable being 24 00:02:06.180 --> 00:02:09.840 a demographic like age or gender or race 25 00:02:09.840 --> 00:02:12.690 or income or something like that, 26 00:02:12.690 --> 00:02:16.323 and the dependent being some sort of attitude or behavior. 27 00:02:18.390 --> 00:02:22.600 So one of the ways that we test for 28 00:02:27.810 --> 00:02:31.170 whether the relationship is a real one, 29 00:02:31.170 --> 00:02:35.280 would it sort of hold up if we did it over and over again 30 00:02:35.280 --> 00:02:37.380 or was it just a fluke? 31 00:02:37.380 --> 00:02:41.193 And we look at the idea of statistical significance. 32 00:02:45.750 --> 00:02:50.750 And this is a way of hypothesis testing 33 00:02:52.950 --> 00:02:57.063 in multivariate and bivariate analysis. 34 00:02:58.230 --> 00:03:02.580 So we always start with a null hypothesis, 35 00:03:02.580 --> 00:03:07.580 that variable A has no effect on variable B, 36 00:03:08.970 --> 00:03:13.970 that Jazz, we look before age, 37 00:03:15.720 --> 00:03:18.420 has no effect on support for gay marriage, 38 00:03:18.420 --> 00:03:21.783 education has no effect on income. 39 00:03:22.950 --> 00:03:27.570 And we express this as a null hypothesis 40 00:03:27.570 --> 00:03:32.570 and then we test whether that null is true or not. 41 00:03:32.850 --> 00:03:36.933 So in the above example, 42 00:03:37.890 --> 00:03:42.890 if the null is true, then education has no effect on income. 43 00:03:43.086 --> 00:03:47.160 Knowing someone's education gives you no information, 44 00:03:47.160 --> 00:03:49.770 doesn't help you sort of forecast 45 00:03:49.770 --> 00:03:52.230 what their income might be. 46 00:03:52.230 --> 00:03:54.900 And then regression analysis, 47 00:03:54.900 --> 00:03:58.680 it's that one or more of your independents 48 00:04:03.133 --> 00:04:05.820 has no effect on your dependents, 49 00:04:05.820 --> 00:04:08.310 and X has no effect on Y, 50 00:04:08.310 --> 00:04:13.080 that the beta or B for that variable equals zero. 51 00:04:13.080 --> 00:04:17.356 That basically, the regression line is horizontal, 52 00:04:17.356 --> 00:04:22.263 that a change in X has no effect on Y. 53 00:04:27.840 --> 00:04:32.840 The key measure that we do here is the P value, 54 00:04:32.970 --> 00:04:37.533 and P is the probability of a Type I error, 55 00:04:39.330 --> 00:04:44.100 which a Type I error is a false positive test 56 00:04:44.100 --> 00:04:48.210 where you think there's something there but there isn't. 57 00:04:48.210 --> 00:04:52.533 You reject the null when the null is true. 58 00:04:56.070 --> 00:04:59.040 It's again, it's the value of thinking 59 00:04:59.040 --> 00:05:04.040 that there's a true relationship when really there is none. 60 00:05:04.500 --> 00:05:06.660 Usually in social science, 61 00:05:06.660 --> 00:05:11.660 we have three levels of significant 0.01, 0.05 62 00:05:12.192 --> 00:05:16.890 or 0.10, 0.05 and 0.01. 63 00:05:16.890 --> 00:05:20.340 And just to hone in, 64 00:05:20.340 --> 00:05:23.850 if the P value is 0.05, 65 00:05:23.850 --> 00:05:28.850 we're 95% sure of that the relationship is real 66 00:05:29.640 --> 00:05:32.813 and there's 5% of a Type I error. 67 00:05:36.990 --> 00:05:40.560 And when you run the statistics 68 00:05:40.560 --> 00:05:45.543 in a package like SPSS, it gives you the P value. 69 00:05:47.250 --> 00:05:50.490 So here is an example of what's a Type I error 70 00:05:50.490 --> 00:05:52.530 and a Type II error. 71 00:05:52.530 --> 00:05:55.650 So another way of thinking about it is, 72 00:05:55.650 --> 00:05:58.410 if you take a COVID test, 73 00:05:58.410 --> 00:06:02.160 the null hypothesis is that you don't have it 74 00:06:02.160 --> 00:06:06.900 and you get those two lines and it says you have it, 75 00:06:06.900 --> 00:06:10.020 but you actually don't, that's a Type I error. 76 00:06:10.020 --> 00:06:14.160 That means that it told you that 77 00:06:14.160 --> 00:06:17.410 you do have COVID when you don't 78 00:06:18.480 --> 00:06:22.770 like the case on the left here 79 00:06:22.770 --> 00:06:27.543 where the doctor is telling this man you're pregnant. 80 00:06:31.680 --> 00:06:36.510 So we use these do what's called inferential statistics, 81 00:06:36.510 --> 00:06:41.310 which allows you to sort of infer or predict 82 00:06:41.310 --> 00:06:46.310 or sort of extend your findings to the larger population. 83 00:06:50.250 --> 00:06:52.770 In univariate analysis, 84 00:06:52.770 --> 00:06:55.830 it's in the form of those confidence levels 85 00:06:55.830 --> 00:06:59.340 and intervals that we learned about that, 86 00:06:59.340 --> 00:07:03.360 like the Vermonter Poll, we're 95% sure 87 00:07:03.360 --> 00:07:07.571 that the value we got from our survey is within 88 00:07:07.571 --> 00:07:12.571 plus or minus five percentage points of the real value. 89 00:07:13.770 --> 00:07:17.793 And in bivariate or multivariate analysis, 90 00:07:19.440 --> 00:07:24.440 it tells you how likely is the null hypothesis true? 91 00:07:25.710 --> 00:07:29.490 How likely is that this relation, 92 00:07:29.490 --> 00:07:31.917 that there's no relationship versus 93 00:07:31.917 --> 00:07:36.917 there is a relationship between variables. 94 00:07:41.670 --> 00:07:46.670 Recall that the sampling error is the error of, 95 00:07:46.680 --> 00:07:49.740 that you get by not talking to everyone. 96 00:07:49.740 --> 00:07:54.447 It's the difference between the parameter and the statistic. 97 00:07:58.590 --> 00:08:03.590 And we wanna make a claim that what we learn is, 98 00:08:04.170 --> 00:08:08.220 can be generalized. 99 00:08:08.220 --> 00:08:13.220 So we first need a confidence level, we're 95% sure, 100 00:08:15.780 --> 00:08:19.500 and a confidence interval that they are within 101 00:08:19.500 --> 00:08:23.970 plus or minus five percentage point. 102 00:08:23.970 --> 00:08:27.720 This is a measure of the precision of the data. 103 00:08:27.720 --> 00:08:32.250 So a Vermonter Poll question from 2009 said 104 00:08:32.250 --> 00:08:37.250 66.8% of Vermont households had broadband. 105 00:08:37.500 --> 00:08:42.270 So 66% 0.8% of the sample. 106 00:08:42.270 --> 00:08:47.270 So think about out since this was a 95.5 poll, 107 00:08:47.910 --> 00:08:52.910 what can we say about the true value 108 00:08:52.950 --> 00:08:55.230 of the number of Vermont households 109 00:08:55.230 --> 00:08:57.813 that had broadband at that time? 110 00:09:02.970 --> 00:09:05.340 When we look at bivariate analysis, 111 00:09:05.340 --> 00:09:09.485 we think about do groups vary 112 00:09:09.485 --> 00:09:14.220 in how they respond to certain questions. 113 00:09:14.220 --> 00:09:18.570 And usually again or maybe very often, 114 00:09:18.570 --> 00:09:23.250 our independent variable is some sort of demographic 115 00:09:23.250 --> 00:09:26.160 and the dependent is some sort of 116 00:09:26.160 --> 00:09:29.220 attitude, belief, behavior. 117 00:09:29.220 --> 00:09:31.350 So here are three examples. 118 00:09:31.350 --> 00:09:36.350 So in the first one, the independent is major 119 00:09:38.220 --> 00:09:43.220 and the dependent is being a fan of country music 120 00:09:46.230 --> 00:09:49.080 in the second one, again where you're born, 121 00:09:49.080 --> 00:09:52.800 and then the third one, your class rank 122 00:09:52.800 --> 00:09:54.624 and various behaviors. 123 00:09:54.624 --> 00:09:59.624 Note that we always test against a null hypothesis. 124 00:09:59.910 --> 00:10:04.223 The null hypothesis is these two groups behave the same. 125 00:10:04.223 --> 00:10:08.600 There's no measurable way that we can say that 126 00:10:08.600 --> 00:10:13.600 in any meaningful way these two groups are different. 127 00:10:14.010 --> 00:10:15.120 That they basically, 128 00:10:15.120 --> 00:10:20.120 there's no difference in their awareness, 129 00:10:20.339 --> 00:10:24.660 in their preference, in their behavior, et cetera. 130 00:10:24.660 --> 00:10:29.553 And we always again, test against this null hypothesis. 131 00:10:35.760 --> 00:10:39.513 So when we do this sort of bivariate analysis, 132 00:10:41.130 --> 00:10:46.130 statistical significance is can we reject that null? 133 00:10:46.980 --> 00:10:49.260 How certain can we be? 134 00:10:49.260 --> 00:10:53.280 How sure are we that we can reject the null? 135 00:10:53.280 --> 00:10:58.280 So the null means there's no relationship. 136 00:10:59.190 --> 00:11:00.810 If we reject our null, 137 00:11:00.810 --> 00:11:04.590 it means that we do see that these two groups 138 00:11:04.590 --> 00:11:09.540 act or believe or behave differently. 139 00:11:09.540 --> 00:11:14.540 So we look at how likely is it that we can reject our null, 140 00:11:18.120 --> 00:11:21.480 how likely is it given our data 141 00:11:21.480 --> 00:11:24.870 that we can say with a some degree of certainty 142 00:11:24.870 --> 00:11:29.870 that there's a real difference between these groups? 143 00:11:30.420 --> 00:11:32.610 And we use our P value. 144 00:11:32.610 --> 00:11:37.610 So there's nothing magical or written in stone about 0.05. 145 00:11:41.190 --> 00:11:44.460 That's one that we use very often. 146 00:11:44.460 --> 00:11:48.810 And if P is less than 0.05, 147 00:11:48.810 --> 00:11:51.900 it means that we can reject our null 148 00:11:51.900 --> 00:11:56.070 with 95% confidence. 149 00:11:56.070 --> 00:12:00.930 We're 95% sure that our null is false 150 00:12:00.930 --> 00:12:05.430 and only 0.05 sure that our null, 151 00:12:05.430 --> 00:12:10.430 we're saying our null is true when it's actually not true. 152 00:12:11.490 --> 00:12:14.250 So one way of thinking about it again, 153 00:12:14.250 --> 00:12:19.250 are CDAE majors that were born in Vermont taller than? 154 00:12:21.270 --> 00:12:23.910 And note that we can see one of the ways 155 00:12:23.910 --> 00:12:28.910 that we can state our null is there's no, 156 00:12:30.104 --> 00:12:34.152 the difference in their height, 157 00:12:34.152 --> 00:12:37.620 there's no statistical meaningful difference 158 00:12:37.620 --> 00:12:40.803 in the height of these two groups. 159 00:12:42.210 --> 00:12:45.490 So there are four tests that we use 160 00:12:46.440 --> 00:12:50.340 and it depends on what is our dependent variable. 161 00:12:50.340 --> 00:12:53.290 So that's why I've been spending a lot of time 162 00:12:54.210 --> 00:12:57.988 to help you to think about what kind of variable 163 00:12:57.988 --> 00:13:02.988 nominal, ordinal, interval ratio is our dependent. 164 00:13:03.270 --> 00:13:05.460 And that's going to tell you 165 00:13:05.460 --> 00:13:08.433 what kind of test that we use. 166 00:13:11.730 --> 00:13:15.270 So all of these tests, and I think, 167 00:13:15.270 --> 00:13:18.420 I don't know of any statistical test 168 00:13:18.420 --> 00:13:19.980 that doesn't work like this. 169 00:13:19.980 --> 00:13:22.980 There may be some, but none that I use, 170 00:13:22.980 --> 00:13:26.640 probably none that you will commonly use in your work, 171 00:13:26.640 --> 00:13:30.270 certainly none that we use in this class. 172 00:13:30.270 --> 00:13:35.010 So again, we begin with this null hypothesis. 173 00:13:35.010 --> 00:13:40.010 There's no relationship between the two variables. 174 00:13:42.030 --> 00:13:47.030 And each of these tests gives us a test stat. 175 00:13:48.960 --> 00:13:51.460 And this test stat looks at 176 00:13:55.830 --> 00:13:59.820 how would the data look if the null were true 177 00:13:59.820 --> 00:14:03.300 versus how do they actually look. 178 00:14:03.300 --> 00:14:06.909 and note that we can almost always write our null 179 00:14:06.909 --> 00:14:10.380 as something equals zero. 180 00:14:10.380 --> 00:14:15.150 So think thinking back to the heights 181 00:14:15.150 --> 00:14:18.340 of these two groups that one way that 182 00:14:19.380 --> 00:14:22.560 you could express a null hypothesis is 183 00:14:22.560 --> 00:14:27.480 the mean height of these two groups are the same. 184 00:14:27.480 --> 00:14:30.090 And another way of saying that is, 185 00:14:30.090 --> 00:14:31.950 the mean height of group A 186 00:14:31.950 --> 00:14:36.573 minus the mean height of group B equals zero. 187 00:14:37.470 --> 00:14:42.470 And that is the null hypothesis that we test. 188 00:14:42.750 --> 00:14:47.130 And the this test stat looks at 189 00:14:47.130 --> 00:14:51.900 how far away from zero is this, 190 00:14:51.900 --> 00:14:56.900 that if you have a large test stat in absolute value, 191 00:14:57.120 --> 00:15:02.120 not close to zero, a large positive or negative number, 192 00:15:03.420 --> 00:15:08.420 that means that much more likely that our null is not true 193 00:15:09.270 --> 00:15:12.420 and we can with greater confidence 194 00:15:12.420 --> 00:15:15.393 reject our null hypothesis. 195 00:15:16.779 --> 00:15:20.932 A very small test stat means that 196 00:15:20.932 --> 00:15:25.932 there is not a lot of difference 197 00:15:26.250 --> 00:15:30.030 between these two groups that in the sense, 198 00:15:30.030 --> 00:15:32.720 A minus B is very close to the zero, 199 00:15:32.720 --> 00:15:36.210 our test stat is very close to zero. 200 00:15:36.210 --> 00:15:39.630 And in this case we failed to reject our null 201 00:15:39.630 --> 00:15:42.443 and conclude that there's no difference 202 00:15:42.443 --> 00:15:44.313 between these groups. 203 00:15:48.150 --> 00:15:50.458 So one way of thinking about this 204 00:15:50.458 --> 00:15:55.458 is looking at two sets of collegiate athletes. 205 00:15:57.840 --> 00:16:02.340 One is football players, varsity football players, 206 00:16:02.340 --> 00:16:07.340 and two is varsity gymnasts. 207 00:16:07.590 --> 00:16:10.140 Do they have the same body weight? 208 00:16:10.140 --> 00:16:15.140 How could you express this as a null hypothesis? 209 00:16:15.390 --> 00:16:17.010 Think about that. 210 00:16:17.010 --> 00:16:20.400 This is something that you should be able to do. 211 00:16:20.400 --> 00:16:22.164 And then we would test this 212 00:16:22.164 --> 00:16:26.740 and my guess is that we would have a big test stat 213 00:16:27.630 --> 00:16:29.632 a very small P value 214 00:16:29.632 --> 00:16:34.500 and we could say with a lot of confidence that 215 00:16:34.500 --> 00:16:36.744 these two groups are not the same. 216 00:16:36.744 --> 00:16:39.629 We reject our null and we say that 217 00:16:39.629 --> 00:16:43.620 there's a statistically significant difference 218 00:16:43.620 --> 00:16:48.003 in body weight between football players and gymnasts. 219 00:16:52.170 --> 00:16:55.503 So to conclude on this question, 220 00:16:57.150 --> 00:17:01.710 our null hypothesis is that college football players 221 00:17:01.710 --> 00:17:04.476 and college gymnasts have the same average body 222 00:17:04.476 --> 00:17:06.540 weight or mean weight. 223 00:17:06.540 --> 00:17:10.616 We can express this null as the mean weight of footballers 224 00:17:10.616 --> 00:17:14.610 minus the mean weight of gymnast equals zero. 225 00:17:14.610 --> 00:17:17.100 And I think that that if we got the data 226 00:17:17.100 --> 00:17:22.050 and ran the analysis, we would get a great big test stat 227 00:17:22.050 --> 00:17:27.050 and we would with great confidence reject this null 228 00:17:27.955 --> 00:17:30.464 and say that the mean weight 229 00:17:30.464 --> 00:17:33.213 of these two groups are not the same. 230 00:17:38.430 --> 00:17:42.223 This is a good way of thinking about the P value. 231 00:17:46.320 --> 00:17:51.320 So the P value is the probability that you will 232 00:17:57.990 --> 00:18:01.710 reject the null when you shouldn't. 233 00:18:01.710 --> 00:18:05.550 And think about this graph is, 234 00:18:05.550 --> 00:18:09.300 the probability of getting a test stat 235 00:18:09.300 --> 00:18:14.010 of that size if the null is true. 236 00:18:14.010 --> 00:18:17.310 So you can see, so with this bell curve, 237 00:18:17.310 --> 00:18:21.880 if you get a test stat very near zero, 238 00:18:21.880 --> 00:18:26.880 it's fairly likely that the null is true. 239 00:18:27.390 --> 00:18:32.390 But if you get a test stat far away from zero 240 00:18:32.790 --> 00:18:36.960 way out on the tails of this graph 241 00:18:36.960 --> 00:18:40.980 such as the area shaded in blue here, 242 00:18:40.980 --> 00:18:44.610 it is very unlikely that you would get 243 00:18:44.610 --> 00:18:48.420 that test stat if the null were true. 244 00:18:48.420 --> 00:18:50.286 And the P value there, 245 00:18:50.286 --> 00:18:55.286 that where you see t and minus t on this graph, 246 00:18:56.190 --> 00:19:01.190 the area under this curve to the right of t 247 00:19:05.400 --> 00:19:10.400 and to the left of minus t is 0.05 of the entire area. 248 00:19:15.930 --> 00:19:20.483 So to the right of t is 2 1/2 % 249 00:19:21.600 --> 00:19:26.600 and to the left of minus t is 2 1/2. 250 00:19:26.700 --> 00:19:29.853 So the area under the curve, 251 00:19:31.530 --> 00:19:34.020 the left and right of those points, 252 00:19:34.020 --> 00:19:38.370 these blue shaded area makes 5% 253 00:19:38.370 --> 00:19:42.930 of the total area of the graph. 254 00:19:42.930 --> 00:19:47.930 So if you get a test stat that's to the right of t 255 00:19:48.125 --> 00:19:50.910 or to the left of minus t, 256 00:19:50.910 --> 00:19:53.370 we would reject our null 257 00:19:53.370 --> 00:19:56.183 and conclude that the groups 258 00:20:00.240 --> 00:20:04.290 that we are studying don't respond to the question, 259 00:20:04.290 --> 00:20:09.290 don't have the same value of the dependent variable. 260 00:20:09.688 --> 00:20:14.043 And again, all test stats basically work like this. 261 00:20:15.990 --> 00:20:19.410 So what kind of tests do you use 262 00:20:19.410 --> 00:20:22.830 based on what kind of variable do you have? 263 00:20:22.830 --> 00:20:27.830 And this, that it's always the dependent variable, 264 00:20:29.760 --> 00:20:34.533 the attitude, the behavior, sort of the outcome. 265 00:20:35.520 --> 00:20:40.520 So if your dependent is nominal, 266 00:20:42.780 --> 00:20:45.210 you use a Chi-Square test. 267 00:20:45.210 --> 00:20:48.180 So our Vermont born students 268 00:20:48.180 --> 00:20:51.783 are more likely to have blue eyes than non Vermont. 269 00:20:52.800 --> 00:20:57.630 Are they are more likely to be Boston Red Sox fans? 270 00:20:57.630 --> 00:21:01.050 So if think about a nominal variable. 271 00:21:01.050 --> 00:21:04.380 And if your dependent is nominal, 272 00:21:04.380 --> 00:21:06.360 you use a Chi-Square test. 273 00:21:06.360 --> 00:21:08.460 And think about, 274 00:21:08.460 --> 00:21:12.930 come up with another example of this on your own. 275 00:21:12.930 --> 00:21:14.880 That's something that you'll have to be able 276 00:21:14.880 --> 00:21:19.880 to do for class is, both look at a question like this 277 00:21:20.880 --> 00:21:25.530 and say this is the right test or if I say, 278 00:21:25.530 --> 00:21:29.400 give an example of a question 279 00:21:29.400 --> 00:21:32.040 where you would use a Chi-Square test, 280 00:21:32.040 --> 00:21:34.650 both of those are absolutely fair game, 281 00:21:34.650 --> 00:21:37.080 something that you should know how to do 282 00:21:37.080 --> 00:21:41.913 for this class on an exam, on an assignment, et cetera. 283 00:21:44.760 --> 00:21:49.760 If you have ordinal, you use a Kruskal-Wallis test. 284 00:21:50.430 --> 00:21:54.240 So if the dependent are, 285 00:21:54.240 --> 00:21:56.460 how likely are you on a five point scale 286 00:21:56.460 --> 00:21:58.893 to buy a car in the next year? 287 00:22:00.477 --> 00:22:04.290 And our independent is marital status, 288 00:22:04.290 --> 00:22:08.070 but it could also be what state you were born in, 289 00:22:08.070 --> 00:22:12.753 your major, your class rank, any of those things. 290 00:22:14.340 --> 00:22:17.460 If your dependent variable is ordinal, 291 00:22:17.460 --> 00:22:19.803 you use a Kruskal-Wallis test. 292 00:22:22.590 --> 00:22:27.000 T-tests, so the next few tests are all 293 00:22:27.000 --> 00:22:30.120 if you have interval or ratio. 294 00:22:30.120 --> 00:22:35.120 And now it matters what is the nature of your independent, 295 00:22:36.000 --> 00:22:40.233 like what is the nature of the demographic groups? 296 00:22:41.580 --> 00:22:44.597 So in this example, a T-test 297 00:22:50.580 --> 00:22:55.580 is when you have two groups only group A, group B. 298 00:22:56.880 --> 00:23:00.690 So here it's CENT majors versus PComm majors, 299 00:23:00.690 --> 00:23:05.460 folks born in Vermont versus not born in Vermont, 300 00:23:05.460 --> 00:23:09.000 Red Sox vans versus non Red Sox fans. 301 00:23:09.000 --> 00:23:09.833 I don't know. 302 00:23:09.833 --> 00:23:14.040 Think about two groups that you are comparing. 303 00:23:14.040 --> 00:23:18.660 And if your dependent is interval or ratio, 304 00:23:18.660 --> 00:23:22.140 how much do you spend, how much do you earn, 305 00:23:22.140 --> 00:23:26.370 how many miles do you go on your bike or drive? 306 00:23:26.370 --> 00:23:30.180 Anything with a ratio or interval variable, 307 00:23:30.180 --> 00:23:34.683 you use a T-test when you're comparing two groups. 308 00:23:38.460 --> 00:23:40.804 And just to do a little bit more 309 00:23:40.804 --> 00:23:44.520 of a deep dive into a T-test, 310 00:23:44.520 --> 00:23:48.499 a T stat is the difference in the means 311 00:23:48.499 --> 00:23:53.499 between the two groups divided by the standard error. 312 00:23:56.160 --> 00:23:59.709 So you can think about it as the numerator is, 313 00:23:59.709 --> 00:24:04.709 what is the difference between these two group's means, 314 00:24:07.320 --> 00:24:10.920 and the denominator is how spread out, 315 00:24:10.920 --> 00:24:15.920 how precise or imprecise are the data? 316 00:24:16.410 --> 00:24:20.070 So you see in case one and case two here, 317 00:24:20.070 --> 00:24:22.131 these two groups have, 318 00:24:22.131 --> 00:24:27.131 the green group and the blue group have the same mean. 319 00:24:27.510 --> 00:24:32.100 But in case one, the top set of graphs, 320 00:24:32.100 --> 00:24:35.430 the standard error is much smaller, 321 00:24:35.430 --> 00:24:40.260 it's a much more precise measurement. 322 00:24:40.260 --> 00:24:43.590 So the T stat in case one 323 00:24:43.590 --> 00:24:48.590 would be greater than the case stat in case two. 324 00:24:48.810 --> 00:24:50.340 We would get a bigger T stat 325 00:24:50.340 --> 00:24:54.810 and we would be much more likely to reject the null 326 00:24:54.810 --> 00:24:57.243 in case one than in case two. 327 00:25:02.280 --> 00:25:07.280 We are staying with ratio and interval variables here 328 00:25:07.350 --> 00:25:10.980 and now we look at a test called ANOVA which stands 329 00:25:10.980 --> 00:25:14.103 for analysis of the variance. 330 00:25:15.810 --> 00:25:17.940 This is still for ratio and interval, 331 00:25:17.940 --> 00:25:21.750 but now there are more than two groups. 332 00:25:21.750 --> 00:25:26.750 So if you look at CDAEs majors that PComm, CENT, CID 333 00:25:32.850 --> 00:25:36.363 and then measuring their income, 334 00:25:38.040 --> 00:25:38.873 that would be, 335 00:25:38.873 --> 00:25:43.503 since there are three groups, that would be an ANOVA. 336 00:25:45.090 --> 00:25:47.220 And one way ANOVA means that 337 00:25:47.220 --> 00:25:51.360 there is one independent variable. 338 00:25:51.360 --> 00:25:55.257 A two way ANOVA can have two independent, 339 00:25:55.257 --> 00:25:58.140 your major and your year of admission 340 00:25:58.140 --> 00:26:02.370 and the dependent is still income. 341 00:26:02.370 --> 00:26:05.591 And note that a T-test is just sort 342 00:26:05.591 --> 00:26:10.591 of a special case of ANOVA where there are only two groups, 343 00:26:14.160 --> 00:26:18.450 but one way ANOVA has one independent, 344 00:26:18.450 --> 00:26:21.120 two way in Nova has two independent, 345 00:26:21.120 --> 00:26:24.330 but in both case your dependent is a ratio 346 00:26:24.330 --> 00:26:27.444 or interval variable like income, expenditure, 347 00:26:27.444 --> 00:26:30.183 mileage and so on. 348 00:26:33.210 --> 00:26:35.463 Now let's look at regression. 349 00:26:36.330 --> 00:26:40.320 How certain are we that one of our betas, 350 00:26:40.320 --> 00:26:44.760 that one of our independent variables like X1 351 00:26:44.760 --> 00:26:48.390 has a significant effect on our Y. 352 00:26:48.390 --> 00:26:52.530 How can we know that? 353 00:26:52.530 --> 00:26:54.540 Is there a measurable, 354 00:26:54.540 --> 00:26:59.250 a significant change in Y when we change X? 355 00:26:59.250 --> 00:27:03.480 So if Y is income and X is income X1 is income, 356 00:27:03.480 --> 00:27:07.680 does an increase in income on average 357 00:27:07.680 --> 00:27:12.360 result in an increase of expenditure on some good 358 00:27:12.360 --> 00:27:15.960 or a decrease or no effect? 359 00:27:15.960 --> 00:27:20.780 So again, in a regression Y is a function of X, 360 00:27:22.710 --> 00:27:24.120 Y is our dependent, 361 00:27:24.120 --> 00:27:28.170 X is a series of independence. 362 00:27:28.170 --> 00:27:32.982 It might be things like age, income, 363 00:27:32.982 --> 00:27:37.710 experience, education and so on. 364 00:27:37.710 --> 00:27:40.650 And we look at holding all else equal, 365 00:27:40.650 --> 00:27:45.650 what is the effect of a unit change of one of the Xs on Y? 366 00:27:47.790 --> 00:27:50.700 And then our error term is that part of the model 367 00:27:50.700 --> 00:27:53.443 that's not explained by the Xs. 368 00:28:58.290 --> 00:29:03.290 And the regression software calculates this Beta0 369 00:29:05.370 --> 00:29:08.460 and Beta1 such that 370 00:29:08.460 --> 00:29:12.897 the distance between the real observation and the model, 371 00:29:17.520 --> 00:29:20.760 the real observation and the estimated value 372 00:29:20.760 --> 00:29:22.800 is as small as it can, 373 00:29:22.800 --> 00:29:25.740 it makes the error as small as it can. 374 00:29:25.740 --> 00:29:27.210 And if there are, 375 00:29:27.210 --> 00:29:32.210 you can run a model with a large number of regressors like K 376 00:29:33.360 --> 00:29:37.860 and that there will Beta0, Beta1, 377 00:29:37.860 --> 00:29:42.860 where each Beta is, Beta1 is what is the effect 378 00:29:45.184 --> 00:29:50.184 of a one unit change in X1 holding all else equal? 379 00:29:54.120 --> 00:29:56.850 When we look at statistical significance, 380 00:29:56.850 --> 00:30:01.850 our null hypothesis is that beta equals zero for some beta. 381 00:30:03.240 --> 00:30:05.610 And we look at them basically, 382 00:30:05.610 --> 00:30:09.180 one at a time. 383 00:30:09.180 --> 00:30:14.180 And that would be like, the regression line is horizontal. 384 00:30:15.475 --> 00:30:20.190 A change in X result in no change at all. 385 00:30:20.190 --> 00:30:25.190 And y, and again, for a P value less than 0.05, 386 00:30:27.450 --> 00:30:28.792 we reject our null. 387 00:30:28.792 --> 00:30:33.792 We say that there's a 95% chance that X has a real effect. 388 00:30:35.400 --> 00:30:40.260 For P-values greater than that we failed to reject our null 389 00:30:40.260 --> 00:30:44.900 and we cannot say when we failed to reject our null, 390 00:30:44.900 --> 00:30:48.886 we're saying that we are very uncertain 391 00:30:48.886 --> 00:30:52.203 that X has any effect on Y. 392 00:30:56.400 --> 00:31:00.870 The most common kind of regression is 393 00:31:00.870 --> 00:31:04.530 called Ordinary Least Squares or OLS. 394 00:31:04.530 --> 00:31:08.727 And this is a linear regression 395 00:31:13.620 --> 00:31:18.620 where the effect of Y is modeled as the sum of 396 00:31:18.930 --> 00:31:21.933 the effect of all of these Xs. 397 00:31:23.550 --> 00:31:28.357 That Y is sort of what each individual says 398 00:31:30.120 --> 00:31:33.030 for the dependent value on their survey. 399 00:31:33.030 --> 00:31:38.030 X is what each individual says for each independent. 400 00:31:38.700 --> 00:31:42.553 These betas are the estimated slope coefficient. 401 00:31:44.370 --> 00:31:46.350 And again, our error term is that part 402 00:31:46.350 --> 00:31:49.473 of Y that is not explained by X. 403 00:31:51.600 --> 00:31:54.960 So here's an example, 404 00:31:54.960 --> 00:31:59.960 money spent on food where X1 is income, X2 is age 405 00:32:00.660 --> 00:32:02.940 and X3 is being female. 406 00:32:02.940 --> 00:32:04.580 And note that, 407 00:32:04.580 --> 00:32:08.880 so Beta1 is what is the effect of Y 408 00:32:08.880 --> 00:32:13.880 when we increase their income by $1. 409 00:32:16.230 --> 00:32:20.640 If Beta1 is significant and greater than zero, 410 00:32:20.640 --> 00:32:22.770 it means give them more money, 411 00:32:22.770 --> 00:32:24.420 they'll spend more on food. 412 00:32:24.420 --> 00:32:28.530 If it's significant and less than zero, 413 00:32:28.530 --> 00:32:32.490 it's give 'em more money and they'll spend less on food. 414 00:32:32.490 --> 00:32:37.490 And the benefit, if we cannot reject our null, 415 00:32:37.560 --> 00:32:42.560 we say that give them more money and it will have no effect, 416 00:32:44.760 --> 00:32:48.660 and the same for the age and for being female 417 00:32:48.660 --> 00:32:51.543 as opposed to being non female. 418 00:32:54.240 --> 00:32:57.003 There are many kinds of regression. 419 00:33:00.030 --> 00:33:02.280 And note that, so first of all, 420 00:33:02.280 --> 00:33:06.480 the most common one is ordinary least squares 421 00:33:06.480 --> 00:33:11.480 and that is when Y is a ratio or an interval variable. 422 00:33:13.530 --> 00:33:18.420 It's not binary, it's not ordinal and it's not nominal. 423 00:33:18.420 --> 00:33:23.276 So think about what are some dependent variables 424 00:33:23.276 --> 00:33:24.810 that we could measure. 425 00:33:24.810 --> 00:33:26.812 I gave a bunch of examples, expenditure, 426 00:33:26.812 --> 00:33:29.763 mileage, things like that. 427 00:33:32.400 --> 00:33:36.843 A person's carbon footprint and CO2 equivalent, 428 00:33:38.250 --> 00:33:39.453 many examples. 429 00:33:41.970 --> 00:33:46.890 If you have a dependent variable that is binary, 430 00:33:46.890 --> 00:33:49.650 do you own a refillable water bottle? 431 00:33:49.650 --> 00:33:51.930 Do you earn a degree? Do you have a bike? 432 00:33:51.930 --> 00:33:54.480 Do you smoke or vape? 433 00:33:54.480 --> 00:33:58.530 I hope that you don't and don't because it's very bad. 434 00:33:58.530 --> 00:34:02.130 But that you can model this, 435 00:34:02.130 --> 00:34:07.130 is where 1 is a yes, when y equals 1 or well, 436 00:34:09.240 --> 00:34:13.500 if they say yes, they own a bottle, Y equals 1, 437 00:34:13.500 --> 00:34:17.430 if they say no, I don't add a bottle, Y equals 0. 438 00:34:17.430 --> 00:34:22.430 And a positive sign on X such as class rank say, 439 00:34:23.460 --> 00:34:28.460 means that those with a larger class rank, 440 00:34:29.340 --> 00:34:32.280 seniors as opposed to the first years, 441 00:34:32.280 --> 00:34:35.754 are more likely to own a bottle. 442 00:34:35.754 --> 00:34:40.754 And if your dependent is ordinal, how likely are you? 443 00:34:44.400 --> 00:34:47.016 How much do you agree with? 444 00:34:47.016 --> 00:34:48.570 You use what's called 445 00:34:48.570 --> 00:34:52.410 either ordered probit or ordered logit. 446 00:34:52.410 --> 00:34:57.410 And here again, a positive and significant beta on X 447 00:34:57.540 --> 00:35:02.540 means that a larger response in X 448 00:35:03.780 --> 00:35:05.280 makes it more like sort of, 449 00:35:05.280 --> 00:35:10.280 more agreement or more likely to basically, 450 00:35:11.220 --> 00:35:16.083 it corresponds to a higher number on the scale. 451 00:35:19.530 --> 00:35:24.530 So how do you interpret regression results? 452 00:35:24.750 --> 00:35:27.903 And we will look at this in class. 453 00:35:28.770 --> 00:35:33.270 First you look at which of the variables, 454 00:35:33.270 --> 00:35:35.280 the Xs are significant? 455 00:35:35.280 --> 00:35:37.230 And in many times, 456 00:35:37.230 --> 00:35:40.710 is a very common thing that you'll see. 457 00:35:40.710 --> 00:35:43.650 If P is less than 0.01, 458 00:35:43.650 --> 00:35:47.250 it will have one asterisk or one star, 459 00:35:47.250 --> 00:35:50.010 if P is less than 0.05 it would have two, 460 00:35:50.010 --> 00:35:53.700 and if P is less than 0.01, we it'll have three. 461 00:35:53.700 --> 00:35:56.160 So we look at which ones are significant. 462 00:35:56.160 --> 00:35:58.860 If they're not, if there's no asterisk, 463 00:35:58.860 --> 00:36:03.300 if their P is greater than 0.10 464 00:36:03.300 --> 00:36:07.560 the convention is that we don't reject our null 465 00:36:07.560 --> 00:36:12.560 and we say thus, this X has no significant effect on Y. 466 00:36:13.350 --> 00:36:16.773 But if they are, then look at the sign. 467 00:36:18.750 --> 00:36:23.750 So if we look at our example, 468 00:36:24.690 --> 00:36:28.350 Y equals local food expenditure, 469 00:36:28.350 --> 00:36:32.343 X1 equals income X2 equals age and X3 equals female, 470 00:36:34.200 --> 00:36:37.620 how would you interpret this? 471 00:36:37.620 --> 00:36:41.220 What is a Beta1 of 0.01 mean? 472 00:36:41.220 --> 00:36:45.900 What does a Beta3 of 5.25 mean? 473 00:36:45.900 --> 00:36:48.330 How would you interpret that? 474 00:36:48.330 --> 00:36:52.083 And we will learn how to do that in class. 475 00:36:54.270 --> 00:36:56.733 All right. So this is what we did. 476 00:36:58.860 --> 00:37:03.860 We talked about the various tests that we do 477 00:37:04.080 --> 00:37:07.413 and we talked about statistical significance. 478 00:37:09.090 --> 00:37:13.680 And this is what you should know, the descriptive stat. 479 00:37:13.680 --> 00:37:18.663 What does inference mean with confidence levels? 480 00:37:19.500 --> 00:37:23.643 What tests go with what type of variable? 481 00:37:24.750 --> 00:37:27.123 How do you interpret regression? 482 00:37:28.500 --> 00:37:30.273 Which types of regression go 483 00:37:30.273 --> 00:37:34.350 with which types of dependent variables? 484 00:37:34.350 --> 00:37:37.620 And what does statistical significance mean? 485 00:37:37.620 --> 00:37:39.727 So this is all things that you should know 486 00:37:39.727 --> 00:37:44.727 based on this topic. 487 00:37:45.390 --> 00:37:46.233 Thank you.