WEBVTT 1 00:00:03.540 --> 00:00:05.250 Hello and welcome to the video lecture 2 00:00:05.250 --> 00:00:08.370 on Quantitative Data Analysis. 3 00:00:08.370 --> 00:00:12.430 So this is the first of two parts of how do you analyze data 4 00:00:13.530 --> 00:00:15.910 such as that you would get in a survey 5 00:00:18.240 --> 00:00:23.240 and that's we'll be doing as well with our research project. 6 00:00:27.150 --> 00:00:28.390 So we're gonna look at 7 00:00:29.340 --> 00:00:34.340 how to analyze data one variable at a time, univariate. 8 00:00:36.210 --> 00:00:38.100 Bivariate, which is two at a time. 9 00:00:38.100 --> 00:00:42.870 Multivariate, which is many variables at the same time. 10 00:00:42.870 --> 00:00:44.430 And we're gonna do a review 11 00:00:44.430 --> 00:00:48.780 of the four types of variables that we talked about 12 00:00:48.780 --> 00:00:51.723 just because I wanna make sure that you understand that. 13 00:00:55.050 --> 00:01:00.050 So univariate analysis is looking at a single variable. 14 00:01:03.690 --> 00:01:06.540 Usually we look at frequencies 15 00:01:06.540 --> 00:01:08.970 and if it's a ratio or interval, 16 00:01:08.970 --> 00:01:13.080 we can also look at central tendency and dispersion, 17 00:01:13.080 --> 00:01:18.080 which tells us valuable information about that variable. 18 00:01:22.260 --> 00:01:25.500 So a frequency analysis 19 00:01:25.500 --> 00:01:30.500 is how many respondents gave that answer. 20 00:01:31.440 --> 00:01:33.780 It can be done by the number of answers, 21 00:01:33.780 --> 00:01:36.273 it could be done by the percentages. 22 00:01:37.440 --> 00:01:40.200 And especially for nominal, 23 00:01:40.200 --> 00:01:43.980 this would be really the only way to do it, 24 00:01:43.980 --> 00:01:48.980 since they don't have a unit, they're not numerical. 25 00:01:50.760 --> 00:01:55.760 Usually you report ordinal variables in this way as well. 26 00:02:00.330 --> 00:02:04.113 So if you have a ratio or interval, 27 00:02:06.270 --> 00:02:09.300 you can talk about the Central Tendency. 28 00:02:09.300 --> 00:02:14.300 And the three of those are Mean, Median and Mode. 29 00:02:15.270 --> 00:02:18.600 And you can think about which one 30 00:02:18.600 --> 00:02:23.490 gives the best information about a given variable, 31 00:02:23.490 --> 00:02:25.350 about the data. 32 00:02:25.350 --> 00:02:28.770 Which one can best sort of tell the story 33 00:02:28.770 --> 00:02:30.063 of what's happening. 34 00:02:34.410 --> 00:02:37.860 Mean is when you simply add up all the values 35 00:02:37.860 --> 00:02:42.090 and divide by the number of responses. 36 00:02:42.090 --> 00:02:47.090 The weakness of this is that if there are extreme values, 37 00:02:48.390 --> 00:02:50.220 they will skew it. 38 00:02:50.220 --> 00:02:55.220 So an example here where the mean age of 21, 39 00:02:56.160 --> 00:03:00.910 but that really doesn't do a very good job of describing it. 40 00:03:05.850 --> 00:03:10.850 Median is when you rank them from high to low 41 00:03:11.400 --> 00:03:14.793 and take the one in the middle. 42 00:03:16.650 --> 00:03:19.780 That if it's an odd number, it's the middle one. 43 00:03:19.780 --> 00:03:21.900 If it's an even the number, 44 00:03:21.900 --> 00:03:24.180 you take the two middle ones 45 00:03:24.180 --> 00:03:27.374 and take the mean of those two. 46 00:03:27.374 --> 00:03:31.500 In many cases, things like home prices or income, 47 00:03:31.500 --> 00:03:36.157 things that have some very high, extreme values can... 48 00:03:39.060 --> 00:03:41.913 That you'll often see them expressed as median. 49 00:03:43.830 --> 00:03:46.980 And finally, mode is how many... 50 00:03:46.980 --> 00:03:51.090 Or what was the one response that the greatest number 51 00:03:51.090 --> 00:03:53.935 of individuals gave. 52 00:03:53.935 --> 00:03:57.150 And that can be good if there's sort of 53 00:03:57.150 --> 00:04:02.150 a cluster of responses, it can help to tell that. 54 00:04:03.840 --> 00:04:07.170 So in class we'll think a bit more about when is mean, 55 00:04:07.170 --> 00:04:09.423 median and mode the best, 56 00:04:10.860 --> 00:04:13.620 under which circumstances is each one 57 00:04:13.620 --> 00:04:16.380 sort of the best way to tell what is happening 58 00:04:16.380 --> 00:04:17.283 with our data. 59 00:04:23.760 --> 00:04:27.600 Another thing that you can do with numerical data 60 00:04:27.600 --> 00:04:32.600 like ratio or interval that has a unit 61 00:04:32.730 --> 00:04:35.820 is to measure the dispersion. 62 00:04:35.820 --> 00:04:40.620 How spread out among out are the data. 63 00:04:42.960 --> 00:04:44.670 So you can see here, 64 00:04:44.670 --> 00:04:48.900 there's two bell curves here that have the same mean, 65 00:04:48.900 --> 00:04:53.070 but the one in blue has a much small... 66 00:04:53.070 --> 00:04:57.360 The data are much more sort of huddled around the middle. 67 00:04:57.360 --> 00:05:00.600 That you get like a tall thin peak there, 68 00:05:00.600 --> 00:05:03.360 as is the one in red, 69 00:05:03.360 --> 00:05:07.320 which is much more sort of spread out 70 00:05:07.320 --> 00:05:10.770 even though they have the same mean. 71 00:05:10.770 --> 00:05:13.830 This is a measure of precision 72 00:05:13.830 --> 00:05:18.270 and it tends to be expressed in standard deviation 73 00:05:18.270 --> 00:05:20.640 or in variance. 74 00:05:20.640 --> 00:05:25.640 And basically the standard deviation measures on average, 75 00:05:27.309 --> 00:05:32.193 how far away from the mean is the average response. 76 00:05:34.470 --> 00:05:39.333 And it's the positive square root of the variance. 77 00:05:46.890 --> 00:05:48.003 So as I said, 78 00:05:49.140 --> 00:05:54.140 for a nominal variable you wanna report the frequency. 79 00:05:54.480 --> 00:05:56.430 And we'll talk about why... 80 00:05:56.430 --> 00:05:57.263 We'll talk in class, 81 00:05:57.263 --> 00:06:01.623 why does the mean of the nominal variable not make sense. 82 00:06:03.540 --> 00:06:07.560 Ordinal, some people say use frequencies. 83 00:06:07.560 --> 00:06:11.490 Sometimes you will report the mean. 84 00:06:11.490 --> 00:06:12.820 But note that 85 00:06:14.550 --> 00:06:17.343 because there are no units, 86 00:06:18.690 --> 00:06:20.340 frequency might be best. 87 00:06:20.340 --> 00:06:23.760 Like you could say on a five point scale 88 00:06:23.760 --> 00:06:27.420 that the mean response was 3.8. 89 00:06:27.420 --> 00:06:31.770 The sort of danger there is that your four 90 00:06:31.770 --> 00:06:36.160 and my four aren't the same but many still do that. 91 00:06:38.310 --> 00:06:41.130 For interval and ratio, 92 00:06:41.130 --> 00:06:46.130 we very often report the mean. 93 00:06:46.530 --> 00:06:49.290 You could do a frequency, 94 00:06:49.290 --> 00:06:53.610 but especially if you have a lot of subjects 95 00:06:53.610 --> 00:06:55.500 and a lot of responses, 96 00:06:55.500 --> 00:06:59.340 it's just going to be a really long data table 97 00:06:59.340 --> 00:07:03.660 that may not provide that much valuable information, 98 00:07:03.660 --> 00:07:06.450 that's not in the mean media mode 99 00:07:06.450 --> 00:07:09.723 and standard deviation. 100 00:07:13.080 --> 00:07:16.440 Next, we'll jump to bivariate analysis. 101 00:07:16.440 --> 00:07:19.950 The relationship between two variables. 102 00:07:19.950 --> 00:07:24.950 And this is very often done as a cross tab. 103 00:07:25.410 --> 00:07:30.120 So you can see here the rows are where folks live 104 00:07:30.120 --> 00:07:33.660 and the columns are what's their favorite baseball team. 105 00:07:33.660 --> 00:07:38.230 And you can see it has it reported both 106 00:07:40.999 --> 00:07:44.333 in numbers and in percentages. 107 00:07:53.160 --> 00:07:57.030 This allows for subgroup comparisons. 108 00:07:57.030 --> 00:08:02.030 Very often, it's thought of as you think of the demographic 109 00:08:03.270 --> 00:08:06.990 as the independent and some sort of attitude 110 00:08:06.990 --> 00:08:11.070 or behavior as the dependent. 111 00:08:11.070 --> 00:08:15.150 So you could think about how different age groups, 112 00:08:15.150 --> 00:08:17.040 their views on gay marriage, 113 00:08:17.040 --> 00:08:20.910 student versus faculty use of tobacco group, 114 00:08:20.910 --> 00:08:24.240 where the demographic is age or occupation, 115 00:08:24.240 --> 00:08:25.440 which we just said. 116 00:08:25.440 --> 00:08:29.980 And the dependent is some sort of behavior or 117 00:08:32.520 --> 00:08:34.833 belief, attitude, things like that. 118 00:08:37.860 --> 00:08:42.860 Multivariate is where we have several independence. 119 00:08:44.040 --> 00:08:46.233 So you might wanna think about, 120 00:08:50.010 --> 00:08:52.980 like your class year and your major 121 00:08:52.980 --> 00:08:57.980 and its a fact on some sort of dependent level, 122 00:08:58.200 --> 00:09:01.080 like income or again like where you work, 123 00:09:01.080 --> 00:09:06.080 or your support for some view and things like that. 124 00:09:10.230 --> 00:09:13.410 The last example of multivariate 125 00:09:13.410 --> 00:09:17.340 that I wanted to talk to you about is Regression. 126 00:09:17.340 --> 00:09:22.293 And here, you have one dependent, 127 00:09:24.540 --> 00:09:28.393 and a whole lot of independent. 128 00:09:29.760 --> 00:09:33.990 Here is a case where we only have one dependent 129 00:09:33.990 --> 00:09:35.580 and you can think about 130 00:09:35.580 --> 00:09:40.580 if Y is how much was spent on local food in a month, 131 00:09:41.070 --> 00:09:42.813 and X is income. 132 00:09:44.850 --> 00:09:49.320 This B1, this beta one measures how... 133 00:09:49.320 --> 00:09:54.320 What is the predicted change in expenditure on local food 134 00:09:55.650 --> 00:09:59.190 for a unit change in income. 135 00:09:59.190 --> 00:10:02.220 That on average, if you put one more dollar 136 00:10:02.220 --> 00:10:04.860 into somebody's pocket, 137 00:10:04.860 --> 00:10:09.860 how much of it are they going to spend on local food? 138 00:10:10.110 --> 00:10:13.620 And then the error term is the part of the model 139 00:10:13.620 --> 00:10:16.740 that's not explained by the Xs. 140 00:10:16.740 --> 00:10:21.183 And we're gonna talk a lot more about regression. 141 00:10:23.280 --> 00:10:27.093 And here's a video about it if you wanna watch that. 142 00:10:30.690 --> 00:10:32.460 So here's an example, 143 00:10:32.460 --> 00:10:36.090 where there's two regressors, two Xs, 144 00:10:36.090 --> 00:10:40.200 two independence and it might be something like, 145 00:10:40.200 --> 00:10:42.210 age and income. 146 00:10:42.210 --> 00:10:46.170 And again, you look at each of these Bs 147 00:10:46.170 --> 00:10:51.170 is the effect of the change in Y for a given change in X. 148 00:10:52.560 --> 00:10:56.250 And you can think of it as the slope of that line, 149 00:10:56.250 --> 00:11:01.250 so if you have Y your dependent on your Y axis, 150 00:11:02.190 --> 00:11:06.988 and one of your independent on Xs, 151 00:11:06.988 --> 00:11:11.430 and you see what is the slope of that line, 152 00:11:11.430 --> 00:11:13.650 of that best fit line? 153 00:11:13.650 --> 00:11:16.860 And that is the average change in Y 154 00:11:16.860 --> 00:11:18.993 for a given amount index. 155 00:11:22.860 --> 00:11:27.090 We can add a whole lot of regressors, 156 00:11:27.090 --> 00:11:28.470 some number K, 157 00:11:28.470 --> 00:11:31.320 many models might have 10 or more. 158 00:11:31.320 --> 00:11:34.740 And well what this does is holding all else equal. 159 00:11:34.740 --> 00:11:39.570 So holding say age and household size constant, 160 00:11:39.570 --> 00:11:43.020 it looks just at the effective income. 161 00:11:43.020 --> 00:11:45.884 And again, the goal of it is to find 162 00:11:45.884 --> 00:11:50.580 the marginal effect, these betas, or these Bs 163 00:11:50.580 --> 00:11:55.580 of how much does it change in X1 holding all else equal. 164 00:11:57.180 --> 00:12:00.990 Well, what is the expected or predicted, 165 00:12:00.990 --> 00:12:04.233 or average change in Y? 166 00:12:09.390 --> 00:12:14.390 So just a quick reminder of the four types of data. 167 00:12:16.770 --> 00:12:21.770 Nominal is where it has no rank, no high or low, 168 00:12:22.890 --> 00:12:26.730 it's just name for things like what state were you born. 169 00:12:26.730 --> 00:12:31.140 Ordinal tends to be a scale, a rating, or a ranking. 170 00:12:31.140 --> 00:12:34.143 How likely are you to do something interview? 171 00:12:35.730 --> 00:12:39.300 I mean interval is has no true zero, 172 00:12:39.300 --> 00:12:41.100 like what year we're born. 173 00:12:41.100 --> 00:12:46.100 And ratio can be expressed as you know, 174 00:12:46.110 --> 00:12:50.460 a response can be twice as big as another 175 00:12:50.460 --> 00:12:52.350 and this has a true zero. 176 00:12:52.350 --> 00:12:54.120 And that can be something like number 177 00:12:54.120 --> 00:12:55.953 of people in your household. 178 00:12:56.910 --> 00:13:01.910 Also, dollars or height, or miles away, or things like that. 179 00:13:02.070 --> 00:13:07.070 And these also have units, both interval and ratio, 180 00:13:09.420 --> 00:13:11.850 variables have units. 181 00:13:11.850 --> 00:13:13.860 Years, 182 00:13:13.860 --> 00:13:16.230 pounds, dollars, 183 00:13:16.230 --> 00:13:19.233 whereas nominal and ordinal have none. 184 00:13:21.990 --> 00:13:25.530 So see if you can get this quiz right. 185 00:13:25.530 --> 00:13:29.103 What kind of variables are these? 186 00:13:30.085 --> 00:13:33.660 And if you want to pause it and then go to the next one 187 00:13:33.660 --> 00:13:38.660 and well, when I'm done speaking and change the slide. 188 00:13:39.870 --> 00:13:42.423 It will tell you which is which. 189 00:13:43.410 --> 00:13:48.410 So pause now and see if you can get it right. 190 00:13:54.281 --> 00:13:55.131 And here you are. 191 00:13:57.260 --> 00:13:58.410 So the first one, 192 00:13:58.410 --> 00:14:03.410 expenditure is a ratio you can spend zero. 193 00:14:03.480 --> 00:14:08.480 If you spend 20 and I spend 10, you spend twice as much. 194 00:14:10.860 --> 00:14:15.300 The sort of a scale here is ordinal. 195 00:14:15.300 --> 00:14:17.820 Degrees Fahrenheit are interval. 196 00:14:17.820 --> 00:14:21.153 And then what's your major is nominal. 197 00:14:23.100 --> 00:14:24.630 So this is what we did. 198 00:14:24.630 --> 00:14:28.140 We talked about univariate, bivariate, 199 00:14:28.140 --> 00:14:29.790 multivariate and regression. 200 00:14:29.790 --> 00:14:32.403 And we talk more about variable types. 201 00:14:34.170 --> 00:14:36.033 And this is what you should know. 202 00:14:37.740 --> 00:14:38.573 Thank you.