WEBVTT 1 00:00:00.750 --> 00:00:03.690 All right, before we start talking about data analytics, 2 00:00:03.690 --> 00:00:05.670 let's talk a little bit about data literacy, 3 00:00:05.670 --> 00:00:08.400 because in order to do data analytics, 4 00:00:08.400 --> 00:00:09.870 you need to be data literate. 5 00:00:09.870 --> 00:00:11.850 So what does data literacy mean? 6 00:00:11.850 --> 00:00:15.390 Before I define it in explicit terms, 7 00:00:15.390 --> 00:00:19.650 this requires data literacy to even understand. 8 00:00:19.650 --> 00:00:22.890 This is not marks on a page and a burnt match. 9 00:00:22.890 --> 00:00:25.710 I mean, it is, but this means something, right? 10 00:00:25.710 --> 00:00:27.990 And you have to be data literate 11 00:00:27.990 --> 00:00:29.580 in order to understand that this is, 12 00:00:29.580 --> 00:00:33.840 this looks like a chart, this is a chart-like thing, okay? 13 00:00:33.840 --> 00:00:35.280 The match is not just a match, 14 00:00:35.280 --> 00:00:37.920 it represents data, a line. 15 00:00:37.920 --> 00:00:39.300 Only a data literate person 16 00:00:39.300 --> 00:00:41.670 would be even aware of that idea. 17 00:00:41.670 --> 00:00:44.640 So what is a specific definition of data literacy? 18 00:00:44.640 --> 00:00:47.490 It's the ability to read, work with, 19 00:00:47.490 --> 00:00:49.890 analyze, and argue with data. 20 00:00:49.890 --> 00:00:52.290 Okay, that's fine, this is a good definition, 21 00:00:52.290 --> 00:00:53.580 it's from MIT. 22 00:00:53.580 --> 00:00:55.500 And another way of thinking about this 23 00:00:55.500 --> 00:00:57.570 a little bit simpler, I would argue, 24 00:00:57.570 --> 00:01:02.570 is it's the ability to think about and do stuff with data. 25 00:01:02.640 --> 00:01:05.700 Think about, look at data, hear data, 26 00:01:05.700 --> 00:01:08.700 understand data, think about what it means, 27 00:01:08.700 --> 00:01:09.660 and then do stuff, 28 00:01:09.660 --> 00:01:11.670 meaning the ability to actually do data analytics 29 00:01:11.670 --> 00:01:12.780 and all that other stuff. 30 00:01:12.780 --> 00:01:16.230 And we all have varying degrees of data literacy, right? 31 00:01:16.230 --> 00:01:18.090 Data scientists are up here 32 00:01:18.090 --> 00:01:20.370 and a first grader is down here, okay? 33 00:01:20.370 --> 00:01:23.430 So we wanna live somewhere in that spectrum, 34 00:01:23.430 --> 00:01:25.200 hopefully closer to the top than the bottom. 35 00:01:25.200 --> 00:01:29.370 But you don't have to be a data scientist to work with data. 36 00:01:29.370 --> 00:01:33.300 Data literacy essentially requires some data fluency, right? 37 00:01:33.300 --> 00:01:35.820 You speak the language of data. 38 00:01:35.820 --> 00:01:39.810 You also have to have some statistics to analytical skills. 39 00:01:39.810 --> 00:01:41.340 Again, not all the way up here, 40 00:01:41.340 --> 00:01:43.200 but you know, somewhere in there. 41 00:01:43.200 --> 00:01:45.180 And data literacy actually does require 42 00:01:45.180 --> 00:01:46.230 visualization as well, 43 00:01:46.230 --> 00:01:48.600 which is one of the reasons we're talking about it here. 44 00:01:48.600 --> 00:01:52.350 Now, data literacy on an organizational level 45 00:01:52.350 --> 00:01:57.350 also requires the democratization and ubiquity of data, 46 00:01:57.840 --> 00:02:00.960 and the tools to access data, right? 47 00:02:00.960 --> 00:02:03.120 It implies that an organization 48 00:02:03.120 --> 00:02:05.520 has made data central to what it does, 49 00:02:05.520 --> 00:02:08.490 and therefore everybody has access to it. 50 00:02:08.490 --> 00:02:10.980 Now, one thing that's interesting is this, 51 00:02:10.980 --> 00:02:12.243 this survey came out, 52 00:02:13.200 --> 00:02:15.270 maybe three or four years ago, from Qlik. 53 00:02:15.270 --> 00:02:18.420 They make data analytics, data visualization, 54 00:02:18.420 --> 00:02:20.613 essentially business intelligence software. 55 00:02:21.450 --> 00:02:26.450 And it found that 24% of business decision makers 56 00:02:27.540 --> 00:02:29.400 consider themselves data literate. 57 00:02:29.400 --> 00:02:30.930 Okay. 58 00:02:30.930 --> 00:02:32.770 My data literacy hat 59 00:02:33.720 --> 00:02:37.980 essentially makes my body tingle when I read this, 60 00:02:37.980 --> 00:02:39.780 because this, even, this simple thing, 61 00:02:39.780 --> 00:02:41.580 this one number I would argue, 62 00:02:41.580 --> 00:02:42.780 from a data literacy standpoint, 63 00:02:42.780 --> 00:02:44.760 is not the way to talk about this. 64 00:02:44.760 --> 00:02:47.670 It's not that 24% are data literate. 65 00:02:47.670 --> 00:02:49.260 The shocking statistic here 66 00:02:49.260 --> 00:02:54.260 is that three out of four business decision makers 67 00:02:54.450 --> 00:02:56.130 are not data literate. 68 00:02:56.130 --> 00:02:57.870 They're data illiterate, right? 69 00:02:57.870 --> 00:03:00.240 So how are they making data-driven decisions 70 00:03:00.240 --> 00:03:01.320 if they're not data literate? 71 00:03:01.320 --> 00:03:03.750 It makes absolutely no sense. 72 00:03:03.750 --> 00:03:08.490 So data literacy, yes, is practical, 73 00:03:08.490 --> 00:03:12.300 tangible, real skills and outcomes, yes, 74 00:03:12.300 --> 00:03:15.120 but it's more than that, okay? 75 00:03:15.120 --> 00:03:18.960 There's other stuff involved in data literacy. 76 00:03:18.960 --> 00:03:21.240 It's understanding things like, 77 00:03:21.240 --> 00:03:23.100 what do these numbers actually mean? 78 00:03:23.100 --> 00:03:25.350 So this website dollarstreet.org 79 00:03:25.350 --> 00:03:28.020 is a website where you can go up and look up, 80 00:03:28.020 --> 00:03:32.070 in this case income, and get a very visceral understanding 81 00:03:32.070 --> 00:03:35.580 of what different levels of income mean by looking at, 82 00:03:35.580 --> 00:03:37.650 in this case, toothbrushes. 83 00:03:37.650 --> 00:03:40.650 What does dental care, brushing your teeth look like 84 00:03:40.650 --> 00:03:43.500 in Burundi versus Ukraine 85 00:03:43.500 --> 00:03:45.690 based on, essentially, income level, 86 00:03:45.690 --> 00:03:47.820 or at least correlated with income level? 87 00:03:47.820 --> 00:03:49.290 You can look at other things as well. 88 00:03:49.290 --> 00:03:52.680 Transportation, homes, all kinds of different, 89 00:03:52.680 --> 00:03:55.230 using photography, a way of understanding 90 00:03:55.230 --> 00:03:56.520 what do these numbers mean? 91 00:03:56.520 --> 00:03:58.350 That's data literacy. 92 00:03:58.350 --> 00:04:00.817 There's also this example from the book, 93 00:04:00.817 --> 00:04:02.760 "Factfulness" by Hans Rosling. 94 00:04:02.760 --> 00:04:04.590 You'll hear his name again. 95 00:04:04.590 --> 00:04:08.220 And so another thing about data literacy is, 96 00:04:08.220 --> 00:04:11.880 do the numbers make sense, okay? 97 00:04:11.880 --> 00:04:13.710 So as an example here, 98 00:04:13.710 --> 00:04:17.160 and by the way only 9% of Norwegian teachers 99 00:04:17.160 --> 00:04:19.020 get this question correct 100 00:04:19.020 --> 00:04:21.960 and only one in four World Economic Forum participants. 101 00:04:21.960 --> 00:04:24.210 So this is kinda shocking, I would argue. 102 00:04:24.210 --> 00:04:26.040 But here's the thing. 103 00:04:26.040 --> 00:04:26.970 Look at the visual here. 104 00:04:26.970 --> 00:04:30.540 World population from 8,000 BC to today. 105 00:04:30.540 --> 00:04:32.580 So thousands and thousands of years ago, 106 00:04:32.580 --> 00:04:34.680 there were essentially like zero people on Earth. 107 00:04:34.680 --> 00:04:37.320 Very, very few, especially could compared to today, 108 00:04:37.320 --> 00:04:39.000 you know, a few million, right? 109 00:04:39.000 --> 00:04:41.520 And over time, thousands of years, 110 00:04:41.520 --> 00:04:44.580 that remained static, okay? 111 00:04:44.580 --> 00:04:46.230 And then eventually, the industrial revolution, 112 00:04:46.230 --> 00:04:48.090 and then boom, skyrocketing population. 113 00:04:48.090 --> 00:04:51.300 Okay, now if you zoom in on that timeline, 114 00:04:51.300 --> 00:04:52.890 not looking at thousands of years 115 00:04:52.890 --> 00:04:56.640 but maybe just the last, you know, several hundred years, 116 00:04:56.640 --> 00:04:57.720 it's fairly steady, 117 00:04:57.720 --> 00:04:59.520 and then it, you know, goes up, certainly, 118 00:04:59.520 --> 00:05:02.640 not quite the same hockey stick but pretty dramatic. 119 00:05:02.640 --> 00:05:07.020 And the question that Norwegian teachers get wrong is, 120 00:05:07.020 --> 00:05:09.360 what's the world population gonna look like 121 00:05:09.360 --> 00:05:11.580 in the year 2100? 122 00:05:11.580 --> 00:05:13.620 And so this is an example where, 123 00:05:13.620 --> 00:05:15.120 of course, we assume the line 124 00:05:15.120 --> 00:05:17.340 is just gonna continue to go up the way it's been going. 125 00:05:17.340 --> 00:05:18.510 We tend to do that. 126 00:05:18.510 --> 00:05:20.790 That's one of the mistakes that we make as humans. 127 00:05:20.790 --> 00:05:21.757 We extrapolate numbers and assume 128 00:05:21.757 --> 00:05:24.000 it's gonna continue to be the same. 129 00:05:24.000 --> 00:05:27.120 Turns out, based on demography 130 00:05:27.120 --> 00:05:30.150 and statistical analysis and all kinds of good stuff, 131 00:05:30.150 --> 00:05:32.550 it looks like the population is actually gonna peak 132 00:05:32.550 --> 00:05:35.430 by 2100 at around 11 billion people. 133 00:05:35.430 --> 00:05:37.410 So essentially it's gonna keep going up, 134 00:05:37.410 --> 00:05:38.730 but it's already gonna be curving, 135 00:05:38.730 --> 00:05:42.180 and then eventually start coming back down soon after that. 136 00:05:42.180 --> 00:05:44.310 You need data literacy to understand that. 137 00:05:44.310 --> 00:05:46.170 You need to literally know the data 138 00:05:46.170 --> 00:05:48.420 to really know that number. 139 00:05:48.420 --> 00:05:51.390 But we just make mistakes and assumptions. 140 00:05:51.390 --> 00:05:53.280 A good example of that is this. 141 00:05:53.280 --> 00:05:56.250 Yesterday I had zero spouses. 142 00:05:56.250 --> 00:05:57.390 Today, I got married. 143 00:05:57.390 --> 00:05:59.010 Today I have one spouse. 144 00:05:59.010 --> 00:06:00.840 How many spouses will I have by the end of the year? 145 00:06:00.840 --> 00:06:02.220 Oh my goodness, right? 146 00:06:02.220 --> 00:06:03.270 Doesn't work that way. 147 00:06:03.270 --> 00:06:05.100 So in some cases it's obvious. 148 00:06:05.100 --> 00:06:06.900 In other cases, it isn't. 149 00:06:06.900 --> 00:06:08.970 You have to be data literate to, 150 00:06:08.970 --> 00:06:12.000 even if you can't predict the actual answer, 151 00:06:12.000 --> 00:06:13.830 to know the questions to ask 152 00:06:13.830 --> 00:06:16.770 so you don't make silly mistakes, okay? 153 00:06:16.770 --> 00:06:18.960 Data literacy also requires understanding things 154 00:06:18.960 --> 00:06:22.260 like data, quality issues, sample quality. 155 00:06:22.260 --> 00:06:24.030 Is it a big enough sample? 156 00:06:24.030 --> 00:06:26.970 What is the makeup of the sample? 157 00:06:26.970 --> 00:06:27.803 Methodology? 158 00:06:27.803 --> 00:06:30.900 What method did you use a Twitter poll pool, excuse me, 159 00:06:30.900 --> 00:06:33.270 Twitter poll to make your decisions? 160 00:06:33.270 --> 00:06:35.733 Or did you actually do a real statistical analysis 161 00:06:35.733 --> 00:06:37.830 in a randomized, you know, double blind study, 162 00:06:37.830 --> 00:06:39.690 et cetera, et cetera, et cetera. 163 00:06:39.690 --> 00:06:43.410 And by the way, is this a useful analysis, right? 164 00:06:43.410 --> 00:06:46.500 A national poll to do predictions 165 00:06:46.500 --> 00:06:48.600 of U.S. presidential elections? 166 00:06:48.600 --> 00:06:50.490 Not so useful, right? 167 00:06:50.490 --> 00:06:52.560 Data literacy also requires understanding the difference 168 00:06:52.560 --> 00:06:54.990 between correlation and causation, right? 169 00:06:54.990 --> 00:06:55.920 You've probably heard the phrase 170 00:06:55.920 --> 00:06:58.377 correlation doesn't equal causation. 171 00:06:58.377 --> 00:06:59.940 All right, well, what does that exactly mean? 172 00:06:59.940 --> 00:07:02.760 Well, let's talk about what correlation is. 173 00:07:02.760 --> 00:07:05.940 There's something called the correlation coefficient, or R, 174 00:07:05.940 --> 00:07:09.750 which is measured on a scale from negative one to one. 175 00:07:09.750 --> 00:07:12.000 That is a statistical measure 176 00:07:12.000 --> 00:07:14.790 of whether variables are correlated, 177 00:07:14.790 --> 00:07:16.346 meaning they're moving in the same direction, 178 00:07:16.346 --> 00:07:18.750 at the same time with each other. 179 00:07:18.750 --> 00:07:22.050 Closer to a one or a negative one, 180 00:07:22.050 --> 00:07:23.970 based on that statistical measure 181 00:07:23.970 --> 00:07:26.850 means it's a stronger correlation, okay? 182 00:07:26.850 --> 00:07:28.050 Either positive or negative. 183 00:07:28.050 --> 00:07:29.610 As one number goes up, the other one goes up. 184 00:07:29.610 --> 00:07:32.130 Or as one number goes up, the other one comes down. 185 00:07:32.130 --> 00:07:33.420 Maybe a little bit too much detail. 186 00:07:33.420 --> 00:07:36.810 Don't worry too much about that except to understand this. 187 00:07:36.810 --> 00:07:38.550 You see a scatter plot? 188 00:07:38.550 --> 00:07:40.050 A bunch of random dots? 189 00:07:40.050 --> 00:07:42.390 Clearly, these are not correlated. 190 00:07:42.390 --> 00:07:45.990 I don't need the R value to understand are these correlated? 191 00:07:45.990 --> 00:07:46.823 No, they're not. 192 00:07:46.823 --> 00:07:48.750 It looks like a random collection of dots. 193 00:07:48.750 --> 00:07:51.120 These look pretty nicely correlated. 194 00:07:51.120 --> 00:07:53.160 As one number goes up to the right, 195 00:07:53.160 --> 00:07:55.890 another one goes up, meaning up on the Y-axis. 196 00:07:55.890 --> 00:07:59.220 Clearly they're sort of clustered and moving in tandem. 197 00:07:59.220 --> 00:08:02.430 This data set is very correlated 198 00:08:02.430 --> 00:08:05.490 with a couple of outliers maybe, right? 199 00:08:05.490 --> 00:08:09.480 So I imagine that the R value is gonna be much higher, 200 00:08:09.480 --> 00:08:12.360 much closer to one on this one 201 00:08:12.360 --> 00:08:14.880 than on the random dots, right? 202 00:08:14.880 --> 00:08:18.930 But the thing is, correlation does not equal causation. 203 00:08:18.930 --> 00:08:20.160 even in that previous example. 204 00:08:20.160 --> 00:08:21.300 It's very heavily correlated, 205 00:08:21.300 --> 00:08:23.160 but it doesn't prove that one causes the other. 206 00:08:23.160 --> 00:08:25.140 And this being a classic example. 207 00:08:25.140 --> 00:08:30.140 The divorce rate in Maine is statistically correlated 208 00:08:30.390 --> 00:08:33.030 with the consumption of margarine. 209 00:08:33.030 --> 00:08:35.070 It's kind of weird one, it's true. 210 00:08:35.070 --> 00:08:37.350 But I don't think we can prove 211 00:08:37.350 --> 00:08:41.040 that the consumption of margarine increases divorce 212 00:08:41.040 --> 00:08:42.660 or vice versa, okay? 213 00:08:42.660 --> 00:08:43.790 That might be true, 214 00:08:43.790 --> 00:08:46.170 ut we haven't proven that in the data. 215 00:08:46.170 --> 00:08:49.500 Another thing, data literate people understand 216 00:08:49.500 --> 00:08:52.620 is the difference between signal and noise, right? 217 00:08:52.620 --> 00:08:55.230 What is, you know, a useful information 218 00:08:55.230 --> 00:08:57.480 and what's just a distraction? 219 00:08:57.480 --> 00:09:01.170 So what's important versus maybe interesting? 220 00:09:01.170 --> 00:09:04.380 And so this comes in all kinds of different forms. 221 00:09:04.380 --> 00:09:06.510 Essentially, the only way to detect this, 222 00:09:06.510 --> 00:09:08.340 to really think about it 223 00:09:08.340 --> 00:09:10.530 is to understand what is the question 224 00:09:10.530 --> 00:09:12.780 you're really asking of your data? 225 00:09:12.780 --> 00:09:14.490 Data literate people know 226 00:09:14.490 --> 00:09:16.320 what they're asking of their data. 227 00:09:16.320 --> 00:09:18.840 Data illiterate people just sort of look at the numbers 228 00:09:18.840 --> 00:09:20.880 and evaluate them without that context, 229 00:09:20.880 --> 00:09:23.070 which isn't very helpful. 230 00:09:23.070 --> 00:09:26.970 It's very, very, very important also to leave bias behind. 231 00:09:26.970 --> 00:09:29.370 And we're gonna talk more about bias in a moment. 232 00:09:29.370 --> 00:09:30.990 But before we do that, 233 00:09:30.990 --> 00:09:32.910 it's also worth knowing 234 00:09:32.910 --> 00:09:37.200 that data literacy is also about context. 235 00:09:37.200 --> 00:09:41.580 Numbers are meaningless without the context. 236 00:09:41.580 --> 00:09:43.140 So this example, 237 00:09:43.140 --> 00:09:45.090 if we look at this guy, he has good form, 238 00:09:45.090 --> 00:09:47.280 he looks like he knows what he's doing. 239 00:09:47.280 --> 00:09:48.960 But once we have the context, 240 00:09:48.960 --> 00:09:51.960 we see that things aren't quite what they seem, right? 241 00:09:51.960 --> 00:09:54.720 The numbers themselves may be meaningless 242 00:09:54.720 --> 00:09:56.010 or may mean the opposite 243 00:09:56.010 --> 00:09:59.970 of what they seem to appear once we introduce context. 244 00:09:59.970 --> 00:10:03.930 So what kinds of context do we need for data analytics? 245 00:10:03.930 --> 00:10:05.880 Answers to questions like, 246 00:10:05.880 --> 00:10:08.070 are these numbers good or bad, right? 247 00:10:08.070 --> 00:10:11.700 And good or bad, compared to what, by the way, okay? 248 00:10:11.700 --> 00:10:16.500 Should we show a rate instead of the actual values, right? 249 00:10:16.500 --> 00:10:17.460 I'm gonna throw a number out there. 250 00:10:17.460 --> 00:10:20.580 87, is that good or bad? 251 00:10:20.580 --> 00:10:23.610 And out of how many, right? 252 00:10:23.610 --> 00:10:26.790 So 87 on its own means nothing. 253 00:10:26.790 --> 00:10:31.740 But 87 out of 10 million, that's a tiny number. 254 00:10:31.740 --> 00:10:35.370 87 out of 90 is a huge number, right? 255 00:10:35.370 --> 00:10:37.080 So understanding that as a rate, 256 00:10:37.080 --> 00:10:39.270 whether it's a percentage, or it's per capita, 257 00:10:39.270 --> 00:10:42.360 or it's a per 1,000, or per 100,000, 258 00:10:42.360 --> 00:10:43.860 that's one of the most important things 259 00:10:43.860 --> 00:10:45.060 we do in data analytics, 260 00:10:45.060 --> 00:10:50.060 is we convert values into rates and we evaluate the rates. 261 00:10:50.160 --> 00:10:53.340 Instead of, sometimes, or at least in addition to 262 00:10:53.340 --> 00:10:55.230 the actual values themselves. 263 00:10:55.230 --> 00:10:59.040 Other questions we ask ourselves as data literate people is 264 00:10:59.040 --> 00:11:01.110 compared to what, which I already said. 265 00:11:01.110 --> 00:11:03.000 Does time matter? 266 00:11:03.000 --> 00:11:04.950 Does geography matter? 267 00:11:04.950 --> 00:11:07.410 Questions like, okay, someone's showing me the average, 268 00:11:07.410 --> 00:11:09.210 but maybe I should be looking at the median, 269 00:11:09.210 --> 00:11:10.650 or the mode? 270 00:11:10.650 --> 00:11:12.810 Anyone remember what the mode is? 271 00:11:12.810 --> 00:11:14.070 Most people forget that one. 272 00:11:14.070 --> 00:11:15.300 It's a good one. (chuckles) 273 00:11:15.300 --> 00:11:16.170 Something to keep in mind. 274 00:11:16.170 --> 00:11:18.420 We almost never refer to it by that name. 275 00:11:18.420 --> 00:11:20.940 Long story short, you have to be data literate 276 00:11:20.940 --> 00:11:23.460 to even know to ask that question. 277 00:11:23.460 --> 00:11:26.160 So rates. 278 00:11:26.160 --> 00:11:28.920 Using ratios, using indexing, 279 00:11:28.920 --> 00:11:31.830 other ways of converting the raw values into stuff 280 00:11:31.830 --> 00:11:34.890 by essentially dividing them by other numbers. 281 00:11:34.890 --> 00:11:39.000 An example of that is that in 2015, 282 00:11:39.000 --> 00:11:41.070 there were 9.9 million people 283 00:11:41.070 --> 00:11:43.590 living in extreme poverty around the world. 284 00:11:43.590 --> 00:11:46.080 That's a big number, right? 285 00:11:46.080 --> 00:11:49.620 Until you learn that it was 35.9 people, 286 00:11:49.620 --> 00:11:52.440 million people rather, in 1990. 287 00:11:52.440 --> 00:11:56.550 In other words, it it decreased by 72% in those, 288 00:11:56.550 --> 00:11:59.370 what, 25 years or so? 289 00:11:59.370 --> 00:12:00.540 That's pretty interesting. 290 00:12:00.540 --> 00:12:01.980 That's a big deal. 291 00:12:01.980 --> 00:12:04.740 Now, we didn't factor in population, okay? 292 00:12:04.740 --> 00:12:08.700 In 2015, the population was 7.34 billion people 293 00:12:08.700 --> 00:12:10.800 versus 1990, it was 5.28 billion. 294 00:12:10.800 --> 00:12:12.900 So it wasn't a 72% drop. 295 00:12:12.900 --> 00:12:15.360 If we look at the per capita rate, 296 00:12:15.360 --> 00:12:20.100 it's actually more than an 80% decrease. 297 00:12:20.100 --> 00:12:21.540 That's a big deal. 298 00:12:21.540 --> 00:12:24.150 Only a data literate person could have expressed that, 299 00:12:24.150 --> 00:12:26.310 that that difference, essentially. 300 00:12:26.310 --> 00:12:27.720 So let's talk about bias for a moment. 301 00:12:27.720 --> 00:12:28.890 It's a very important topic. 302 00:12:28.890 --> 00:12:31.440 We're not gonna talk about it much at all on this course, 303 00:12:31.440 --> 00:12:33.570 but I want to touch on it briefly here today. 304 00:12:33.570 --> 00:12:36.630 There are a lot of cognitive biases, okay? 305 00:12:36.630 --> 00:12:39.360 Humans have deeply flawed psychologies, 306 00:12:39.360 --> 00:12:41.310 brains, all kinds of things. 307 00:12:41.310 --> 00:12:42.990 We make mistakes all the time. 308 00:12:42.990 --> 00:12:46.260 And there are a bunch of them that apply to data. 309 00:12:46.260 --> 00:12:47.700 Just gonna talk about a few of them. 310 00:12:47.700 --> 00:12:50.940 Number one is something called selection bias. 311 00:12:50.940 --> 00:12:54.810 I make a mistake in who I select to study, okay? 312 00:12:54.810 --> 00:12:57.570 As an example, I did a poll, 313 00:12:57.570 --> 00:13:00.480 asking, do you love ice hockey? 314 00:13:00.480 --> 00:13:01.710 Yes or no? 315 00:13:01.710 --> 00:13:03.390 And I asked five people, 316 00:13:03.390 --> 00:13:05.880 sample size, that's not enough people. 317 00:13:05.880 --> 00:13:06.840 And I asked them, 318 00:13:06.840 --> 00:13:08.550 people who are coming out of a hockey game, 319 00:13:08.550 --> 00:13:10.440 that's my selection bias, right? (chuckles) 320 00:13:10.440 --> 00:13:11.760 So it turns out I tell the world 321 00:13:11.760 --> 00:13:13.320 100% of people love ice hockey 322 00:13:13.320 --> 00:13:15.480 because I asked five people coming outta a hockey game 323 00:13:15.480 --> 00:13:19.140 maybe not quite so accurate, okay? 324 00:13:19.140 --> 00:13:20.370 Recall bias. 325 00:13:20.370 --> 00:13:21.480 Okay, recall bias, 326 00:13:21.480 --> 00:13:24.960 we think we remember things a certain way, 327 00:13:24.960 --> 00:13:26.460 turns out we're wrong. 328 00:13:26.460 --> 00:13:27.630 There was a study that was done. 329 00:13:27.630 --> 00:13:29.520 This is a real result. 330 00:13:29.520 --> 00:13:31.710 Women with breast cancer 331 00:13:31.710 --> 00:13:34.320 ate higher fat diets when they were young, 332 00:13:34.320 --> 00:13:36.390 was the result of this study. 333 00:13:36.390 --> 00:13:38.370 Okay, maybe that's true. 334 00:13:38.370 --> 00:13:39.390 Turned out it's not true. 335 00:13:39.390 --> 00:13:41.190 Or at least they couldn't prove it was true. 336 00:13:41.190 --> 00:13:43.200 What they realized was happening 337 00:13:43.200 --> 00:13:45.300 is that women with breast cancer 338 00:13:45.300 --> 00:13:46.590 answering a survey question 339 00:13:46.590 --> 00:13:48.900 about what they did when they were younger, 340 00:13:48.900 --> 00:13:51.210 their recall was bad. 341 00:13:51.210 --> 00:13:54.270 Essentially, they assumed, because they had cancer, 342 00:13:54.270 --> 00:13:56.250 they must have done something wrong. 343 00:13:56.250 --> 00:13:57.690 So they were much more likely to say, 344 00:13:57.690 --> 00:14:00.240 yes, I ate a lot of high fat diet, 345 00:14:00.240 --> 00:14:02.100 high fat food when I was young, 346 00:14:02.100 --> 00:14:04.500 compared to women who didn't have breast cancer. 347 00:14:04.500 --> 00:14:07.170 Purely a recall bias problem, okay? 348 00:14:07.170 --> 00:14:08.850 My personal example, 349 00:14:08.850 --> 00:14:11.790 when I was in college many, many, many years ago, 350 00:14:11.790 --> 00:14:14.910 I distinctly remember watching Seinfeld 351 00:14:14.910 --> 00:14:17.940 in my freshman dorm room with my roommates. 352 00:14:17.940 --> 00:14:18.900 I was wrong. 353 00:14:18.900 --> 00:14:19.800 Seinfeld didn't launch 354 00:14:19.800 --> 00:14:21.630 until I was a senior in college, okay? 355 00:14:21.630 --> 00:14:24.240 So recall bias is a big one. 356 00:14:24.240 --> 00:14:26.310 Another one is called survivor bias. 357 00:14:26.310 --> 00:14:28.050 This is the most famous visual 358 00:14:28.050 --> 00:14:31.260 always used to describe survivor bias. 359 00:14:31.260 --> 00:14:33.690 So if you look at this visual, this is a plane, 360 00:14:33.690 --> 00:14:35.400 World War II plane. 361 00:14:35.400 --> 00:14:37.680 And the story is, and I think it's a true story, 362 00:14:37.680 --> 00:14:39.750 that the planes would come back 363 00:14:39.750 --> 00:14:42.240 and the maintenance guys would look at the planes, 364 00:14:42.240 --> 00:14:44.010 and the planes had bullet holes all over them, 365 00:14:44.010 --> 00:14:44.843 as you can see. 366 00:14:44.843 --> 00:14:47.010 And this was like the overall pattern of bullet holes, 367 00:14:47.010 --> 00:14:49.290 looking at however many planes they looked at. 368 00:14:49.290 --> 00:14:51.300 And so they said, hey, you know what we should do 369 00:14:51.300 --> 00:14:54.750 is we should add armor to the tips of the wings, 370 00:14:54.750 --> 00:14:56.550 and to that place behind the cockpit, 371 00:14:56.550 --> 00:14:58.500 and the tail where all those bullet holes are, 372 00:14:58.500 --> 00:15:00.903 and hey, we'll be in good shape, right? 373 00:15:02.010 --> 00:15:03.300 Exactly wrong. 374 00:15:03.300 --> 00:15:04.230 Why? 375 00:15:04.230 --> 00:15:05.640 This is survivor bias, 376 00:15:05.640 --> 00:15:09.210 meaning these are the planes that made it back. 377 00:15:09.210 --> 00:15:11.790 These are the planes that essentially proving the point 378 00:15:11.790 --> 00:15:14.040 that bullets can hit the tips of the wings 379 00:15:14.040 --> 00:15:15.600 and the tail, et cetera, 380 00:15:15.600 --> 00:15:17.130 and you can still fly, 381 00:15:17.130 --> 00:15:19.140 otherwise they wouldn't have made it back. 382 00:15:19.140 --> 00:15:21.030 So the essentially, the answer here is, 383 00:15:21.030 --> 00:15:21.863 don't put 'em there, 384 00:15:21.863 --> 00:15:23.520 Put 'em where you don't see the bullet holes, 385 00:15:23.520 --> 00:15:25.530 'cause those are the planes that didn't make it home. 386 00:15:25.530 --> 00:15:28.650 Cockpit, engines, et cetera, okay? 387 00:15:28.650 --> 00:15:31.710 Key biases also include confirmation bias. 388 00:15:31.710 --> 00:15:34.590 And my favorite quote, I use this one all the time, 389 00:15:34.590 --> 00:15:38.520 if you torture your data long enough, it will confess. 390 00:15:38.520 --> 00:15:40.410 Classic problem in data analytics. 391 00:15:40.410 --> 00:15:42.930 You know, I didn't find anything interesting in this data, 392 00:15:42.930 --> 00:15:44.640 so I kept going until I found something interesting. 393 00:15:44.640 --> 00:15:47.310 Now that one isn't necessarily a bad thing 394 00:15:47.310 --> 00:15:49.890 because if you find nothing in your data, 395 00:15:49.890 --> 00:15:52.320 sometimes the answer is there's nothing going on here. 396 00:15:52.320 --> 00:15:55.050 And you gotta be ready for that and be okay with that. 397 00:15:55.050 --> 00:15:57.390 Sometimes you just, you do have to keep digging. 398 00:15:57.390 --> 00:16:00.270 You have to sort of go down that rabbit hole and keep going. 399 00:16:00.270 --> 00:16:01.830 And eventually maybe you'll find something. 400 00:16:01.830 --> 00:16:03.750 But you gotta find a balance there. 401 00:16:03.750 --> 00:16:05.250 More importantly, 402 00:16:05.250 --> 00:16:07.170 I found the opposite of what I was looking for. 403 00:16:07.170 --> 00:16:08.370 Well, I'm gonna keep looking 404 00:16:08.370 --> 00:16:10.800 until I find what my boss asked me to find, right? 405 00:16:10.800 --> 00:16:12.360 You always wanna avoid that. 406 00:16:12.360 --> 00:16:13.590 And last, but not least, 407 00:16:13.590 --> 00:16:15.450 not because this is the most important bias 408 00:16:15.450 --> 00:16:17.940 or because these are the only biases that affect data, 409 00:16:17.940 --> 00:16:18.983 but I just like this one. 410 00:16:18.983 --> 00:16:21.960 There's one called Simpson's Paradox. 411 00:16:21.960 --> 00:16:22.793 I love this one. 412 00:16:22.793 --> 00:16:24.990 It's a very good one to be aware of. 413 00:16:24.990 --> 00:16:28.260 And of course, it's not The Simpsons, it's this. 414 00:16:28.260 --> 00:16:31.920 Basic idea here is that some data sets, 415 00:16:31.920 --> 00:16:35.280 the overall trend, pattern, 416 00:16:35.280 --> 00:16:38.160 whatever in the data may look one way. 417 00:16:38.160 --> 00:16:41.460 So in this case, the trend, the overall trend line 418 00:16:41.460 --> 00:16:45.090 goes down into the right for this entire dataset. 419 00:16:45.090 --> 00:16:47.460 But if you look at the dataset in segments, 420 00:16:47.460 --> 00:16:48.704 you may find that the segments 421 00:16:48.704 --> 00:16:51.510 can clearly go in very different directions. 422 00:16:51.510 --> 00:16:53.820 And in this case, both of the segments 423 00:16:53.820 --> 00:16:55.290 go in the opposite direction. 424 00:16:55.290 --> 00:16:57.480 They have completely opposite trends 425 00:16:57.480 --> 00:17:00.150 from that overall combined trend. 426 00:17:00.150 --> 00:17:01.980 Something to be very thoughtful of. 427 00:17:01.980 --> 00:17:03.930 You can find interesting insights in your data 428 00:17:03.930 --> 00:17:06.330 if you segment it in certain ways. 429 00:17:06.330 --> 00:17:08.130 Not to mention, you may find your segments 430 00:17:08.130 --> 00:17:09.540 are literally the opposite story 431 00:17:09.540 --> 00:17:12.090 of what you might see overall. 432 00:17:12.090 --> 00:17:14.373 Okay, data literacy, 433 00:17:15.330 --> 00:17:17.670 it's a mindset as much as anything else. 434 00:17:17.670 --> 00:17:20.820 It's about critical thinking. 435 00:17:20.820 --> 00:17:23.520 It's about being a little bit paranoid, 436 00:17:23.520 --> 00:17:25.530 being aware of certain things, 437 00:17:25.530 --> 00:17:28.950 and just asking a million questions, okay? 438 00:17:28.950 --> 00:17:31.743 If you do that, you'll find stories in your data.