WEBVTT 1 00:00:00.357 --> 00:00:03.390 All right, so we're feeling data literate, hopefully, 2 00:00:03.390 --> 00:00:04.680 and it's time to start thinking 3 00:00:04.680 --> 00:00:05.940 about doing some data analytics. 4 00:00:05.940 --> 00:00:07.830 And there's a little bit more prep work, 5 00:00:07.830 --> 00:00:09.660 a little bit more thinking to go through 6 00:00:09.660 --> 00:00:11.160 before we do that. 7 00:00:11.160 --> 00:00:11.993 Number one, 8 00:00:11.993 --> 00:00:12.826 you gotta think about 9 00:00:12.826 --> 00:00:15.930 whether you're doing descriptive, predictive, 10 00:00:15.930 --> 00:00:18.060 or prescriptive analytics. 11 00:00:18.060 --> 00:00:19.980 So let's define those. 12 00:00:19.980 --> 00:00:24.240 Descriptive analytics is describing what has happened. 13 00:00:24.240 --> 00:00:25.860 What happens, okay? 14 00:00:25.860 --> 00:00:30.000 In other words, it's looking at the past existing data. 15 00:00:30.000 --> 00:00:32.730 99% of the time, that's probably what you do. 16 00:00:32.730 --> 00:00:35.100 You look at existing data to figure out what happened. 17 00:00:35.100 --> 00:00:38.880 Okay, predictive analytics, as the word would imply 18 00:00:38.880 --> 00:00:42.210 is making predictions about the future. 19 00:00:42.210 --> 00:00:43.470 Now, you can only do that 20 00:00:43.470 --> 00:00:45.540 based on that descriptive analytics, right? 21 00:00:45.540 --> 00:00:46.560 So we understand the past 22 00:00:46.560 --> 00:00:48.810 and the patterns of what has happened, 23 00:00:48.810 --> 00:00:50.400 then we can predict the future. 24 00:00:50.400 --> 00:00:54.870 Now, it's worth noting predictive analytics is specific. 25 00:00:54.870 --> 00:00:56.490 It's not this happened in the past, 26 00:00:56.490 --> 00:00:58.050 therefore it's gonna keep happening. 27 00:00:58.050 --> 00:00:59.550 It's more like this happened, 28 00:00:59.550 --> 00:01:00.383 but you know, 29 00:01:00.383 --> 00:01:03.000 maybe it's gonna go up by 4.7% next year, right? 30 00:01:03.000 --> 00:01:06.960 So specific ideas, otherwise it's not really analytics. 31 00:01:06.960 --> 00:01:09.330 Then there's also prescriptive analytics, 32 00:01:09.330 --> 00:01:11.490 which is when we make recommendations, 33 00:01:11.490 --> 00:01:15.780 prescriptions to alter that future. 34 00:01:15.780 --> 00:01:17.010 So if we understand the past, 35 00:01:17.010 --> 00:01:18.720 we make predictions about the future, 36 00:01:18.720 --> 00:01:20.970 we recommend changes, 37 00:01:20.970 --> 00:01:22.350 and then we can also then by the way 38 00:01:22.350 --> 00:01:24.600 re predict what's gonna happen, 39 00:01:24.600 --> 00:01:25.980 based on these changes we expect it 40 00:01:25.980 --> 00:01:29.310 to change by 47.2, blah blah, blahs, right? 41 00:01:29.310 --> 00:01:30.720 So these are the three types 42 00:01:30.720 --> 00:01:32.280 of analytics we're doing, 43 00:01:32.280 --> 00:01:33.420 generally, like I said, 44 00:01:33.420 --> 00:01:35.160 you're probably doing descriptive analytics, 45 00:01:35.160 --> 00:01:35.993 that's all good, 46 00:01:35.993 --> 00:01:38.790 but I wanna share a really interesting story 47 00:01:38.790 --> 00:01:42.000 about predictive analytics. 48 00:01:42.000 --> 00:01:44.940 Target, years ago, famous case, 49 00:01:44.940 --> 00:01:46.830 they noticed looking at their data, 50 00:01:46.830 --> 00:01:48.060 like this is like data science 51 00:01:48.060 --> 00:01:51.990 digging deep into like bazillions of transactions, 52 00:01:51.990 --> 00:01:56.340 and they noticed that women were buying, 53 00:01:56.340 --> 00:01:58.560 all of a sudden outta outta nowhere, 54 00:01:58.560 --> 00:02:01.020 a bunch of unscented lotion. 55 00:02:01.020 --> 00:02:02.280 And that this isn't like universally, 56 00:02:02.280 --> 00:02:03.690 but certain women would buy 57 00:02:03.690 --> 00:02:05.740 a lot of unscented lotion all of a sudden 58 00:02:06.690 --> 00:02:09.420 and then a couple, few months later, 59 00:02:09.420 --> 00:02:12.720 they would start buying diapers, and baby bottles, 60 00:02:12.720 --> 00:02:14.700 and all kinds of things like this. 61 00:02:14.700 --> 00:02:18.150 So, logical conclusion was, ah, interesting, 62 00:02:18.150 --> 00:02:21.240 if we notice women buying a bunch of unscented lotion, 63 00:02:21.240 --> 00:02:23.100 which they haven't done before, 64 00:02:23.100 --> 00:02:24.240 that means that they're pregnant, 65 00:02:24.240 --> 00:02:26.700 and specifically that they were 66 00:02:26.700 --> 00:02:29.880 in their third trimester of pregnancy. 67 00:02:29.880 --> 00:02:32.280 And lo and behold, then what they started to do, 68 00:02:32.280 --> 00:02:36.120 was they started to market baby products to these women 69 00:02:36.120 --> 00:02:38.640 during those couple, few months. 70 00:02:38.640 --> 00:02:40.890 So they made this predictive analytics, 71 00:02:40.890 --> 00:02:44.640 understanding, and prescriptive analytics, 72 00:02:44.640 --> 00:02:45.990 let's market baby stuff for them. 73 00:02:45.990 --> 00:02:47.850 And of course, I'm sure they predicted how many profit, 74 00:02:47.850 --> 00:02:49.050 how much profits they would make. 75 00:02:49.050 --> 00:02:50.970 But here's where it got interesting. 76 00:02:50.970 --> 00:02:55.970 There was a 16 year old girl in Minnesota or somewhere 77 00:02:55.980 --> 00:02:58.530 and these flyers started coming to the house 78 00:02:58.530 --> 00:03:01.920 with her name on it, selling her baby products. 79 00:03:01.920 --> 00:03:03.990 And the girl's father freaked out, 80 00:03:03.990 --> 00:03:06.090 called up Target, screaming at people, 81 00:03:06.090 --> 00:03:08.340 why are you sending this stuff to my daughter, 82 00:03:08.340 --> 00:03:09.933 what the hell's going on here? 83 00:03:10.950 --> 00:03:14.340 Turns out she was pregnant, she hadn't told him, okay, 84 00:03:14.340 --> 00:03:16.620 so the predictive analytics got it right, 85 00:03:16.620 --> 00:03:19.470 but of course it introduce introduces some, you know, 86 00:03:19.470 --> 00:03:22.080 good conversation to have about ethics 87 00:03:22.080 --> 00:03:24.180 and all kinds of things we might want to do 88 00:03:24.180 --> 00:03:26.673 with our data privacy issues, et cetera. 89 00:03:27.570 --> 00:03:30.390 So great example, although troubling at the same time. 90 00:03:30.390 --> 00:03:35.100 So data analytics is steps, methodologies, 91 00:03:35.100 --> 00:03:36.690 things you need to do, 92 00:03:36.690 --> 00:03:38.190 but just like data literacy, 93 00:03:38.190 --> 00:03:40.110 it's about critical thinking, 94 00:03:40.110 --> 00:03:41.100 like I talked about before, 95 00:03:41.100 --> 00:03:44.160 it's about thinking very carefully about what you're doing, 96 00:03:44.160 --> 00:03:47.610 how you're gonna do it, why you're even doing it. 97 00:03:47.610 --> 00:03:48.990 And what it's really about, 98 00:03:48.990 --> 00:03:50.850 I think I've already said this before, 99 00:03:50.850 --> 00:03:53.580 it's about asking questions. 100 00:03:53.580 --> 00:03:57.990 What questions do you want your data to answer? 101 00:03:57.990 --> 00:03:59.790 What questions do you need to answer 102 00:03:59.790 --> 00:04:02.910 to do what you need to do with what, your data, okay? 103 00:04:02.910 --> 00:04:07.910 And it's often about forming explicit hypotheses. 104 00:04:08.070 --> 00:04:10.530 Not always, sometimes just exploring a data set 105 00:04:10.530 --> 00:04:12.180 and wondering what's gonna pop out of it. 106 00:04:12.180 --> 00:04:14.790 Data science often is like that 107 00:04:14.790 --> 00:04:17.760 but very frequently you have specific hypotheses. 108 00:04:17.760 --> 00:04:19.680 Hey, we notice this is happening in the business, 109 00:04:19.680 --> 00:04:22.140 or in in science, or whatever the case may be. 110 00:04:22.140 --> 00:04:24.150 We think that this might be causing it. 111 00:04:24.150 --> 00:04:25.530 Then we test that hypothesis. 112 00:04:25.530 --> 00:04:28.440 Either way, you're testing your data, 113 00:04:28.440 --> 00:04:30.750 you're as answering questions with your data. 114 00:04:30.750 --> 00:04:33.990 So you gotta form questions to ask of your data. 115 00:04:33.990 --> 00:04:34.830 And you wanna start, 116 00:04:34.830 --> 00:04:37.770 we wanna start with very broad questions, okay? 117 00:04:37.770 --> 00:04:41.670 Really high level 30,000 foot view questions, 118 00:04:41.670 --> 00:04:43.560 but you narrow in on those details, 119 00:04:43.560 --> 00:04:45.900 and every question might go a little bit deeper, 120 00:04:45.900 --> 00:04:47.550 a little bit deeper, a little bit deeper, 121 00:04:47.550 --> 00:04:49.620 and the deeper you go, 122 00:04:49.620 --> 00:04:52.770 A, it's easier to identify specific variables, 123 00:04:52.770 --> 00:04:55.140 specific things you're looking at, 124 00:04:55.140 --> 00:04:58.380 and more nuanced answers can come out of your data 125 00:04:58.380 --> 00:05:00.120 as part of that process. 126 00:05:00.120 --> 00:05:03.060 You're continuing to ask who, what, when, where, why, 127 00:05:03.060 --> 00:05:04.170 all along the way 128 00:05:04.170 --> 00:05:06.330 of these different fields in your data. 129 00:05:06.330 --> 00:05:07.800 And you should also be asking yourself, 130 00:05:07.800 --> 00:05:09.960 this is more data literacy hat, 131 00:05:09.960 --> 00:05:11.130 what am I missing? 132 00:05:11.130 --> 00:05:12.390 What's wrong here? 133 00:05:12.390 --> 00:05:14.730 Be a little bit paranoid, like I mentioned earlier. 134 00:05:14.730 --> 00:05:17.100 But we talked about this earlier, 135 00:05:17.100 --> 00:05:19.530 you asked questions like, is it good or bad, 136 00:05:19.530 --> 00:05:20.700 compared to what, 137 00:05:20.700 --> 00:05:24.060 should the rate be used here versus the values, 138 00:05:24.060 --> 00:05:26.910 et cetera, et cetera, et cetera. 139 00:05:26.910 --> 00:05:30.780 Now, as you're doing this, as you're analyzing data, 140 00:05:30.780 --> 00:05:31.830 what I recommend you do, 141 00:05:31.830 --> 00:05:36.690 is you think about using the Toyota 5 Why's process. 142 00:05:36.690 --> 00:05:40.110 So Toyota, many, many, many decades ago, 143 00:05:40.110 --> 00:05:40.943 and I'm talking about 144 00:05:40.943 --> 00:05:42.690 the founder of the Toyota Motor Company, 145 00:05:42.690 --> 00:05:45.810 whose name was Toyota, came up with the 5 Why's. 146 00:05:45.810 --> 00:05:49.830 The basic idea is you should ask why is this happening, 147 00:05:49.830 --> 00:05:50.880 of any problem, 148 00:05:50.880 --> 00:05:52.830 anything you're investigating. 149 00:05:52.830 --> 00:05:55.200 The answer to that why 150 00:05:55.200 --> 00:05:58.170 usually should lead to other questions, right? 151 00:05:58.170 --> 00:06:00.870 And you should ask why again, why is that happening? 152 00:06:00.870 --> 00:06:02.520 Okay, why is it this way? 153 00:06:02.520 --> 00:06:05.070 Well, why, tell me why in more detail. 154 00:06:05.070 --> 00:06:06.930 By the time you get to the fifth why, 155 00:06:06.930 --> 00:06:08.700 the idea was, 156 00:06:08.700 --> 00:06:11.040 you'll get to the actual underlying cause 157 00:06:11.040 --> 00:06:13.380 rather than that surface answer. 158 00:06:13.380 --> 00:06:14.340 And so you can think about it 159 00:06:14.340 --> 00:06:16.560 as the 5 why's or the 5 FUs. 160 00:06:16.560 --> 00:06:20.040 And I don't mean that F you, I mean follow ups. 161 00:06:20.040 --> 00:06:21.750 You should be continually 162 00:06:21.750 --> 00:06:24.480 following up with your questions 163 00:06:24.480 --> 00:06:26.310 and eventually you get to the ground level 164 00:06:26.310 --> 00:06:29.190 where you really find the great insights in your data. 165 00:06:29.190 --> 00:06:32.460 So we're gonna do an exercise, okay? 166 00:06:32.460 --> 00:06:34.290 And the exercise is this. 167 00:06:34.290 --> 00:06:38.468 We have a data set, which is an IMDB data set. 168 00:06:38.468 --> 00:06:40.050 So IMDB is a website, 169 00:06:40.050 --> 00:06:41.790 Internet Movies Database, 170 00:06:41.790 --> 00:06:43.230 I think is what it stands for. 171 00:06:43.230 --> 00:06:44.310 I'm sure you've all seen it. 172 00:06:44.310 --> 00:06:47.610 It's, you know, essentially any movie you can look, 173 00:06:47.610 --> 00:06:50.400 or any actor, or director, or whatever, 174 00:06:50.400 --> 00:06:51.330 you can look it up there, 175 00:06:51.330 --> 00:06:53.160 and you'll find out who was in the movie, 176 00:06:53.160 --> 00:06:54.660 what other movies that person was in, 177 00:06:54.660 --> 00:06:55.891 et cetera, et cetera, et cetera. 178 00:06:55.891 --> 00:06:58.800 This is a massive database of of movies data. 179 00:06:58.800 --> 00:07:00.090 And I found a dataset 180 00:07:00.090 --> 00:07:05.090 on Kaggle of the top 1000 movies from IMDB, 181 00:07:06.840 --> 00:07:08.820 and we're gonna take a look at it in a second. 182 00:07:08.820 --> 00:07:11.790 And the basic idea is using this dataset, 183 00:07:11.790 --> 00:07:13.920 and by the way this dataset, 184 00:07:13.920 --> 00:07:15.750 undergrads taking this course, 185 00:07:15.750 --> 00:07:17.250 can use this dataset. 186 00:07:17.250 --> 00:07:18.990 That's the one that I mentioned earlier. 187 00:07:18.990 --> 00:07:22.260 I've done some pre-analysis with this data, okay? 188 00:07:22.260 --> 00:07:24.870 So if you wanna do your class project using this dataset, 189 00:07:24.870 --> 00:07:28.290 if you're an undergrad, go for it, all good. 190 00:07:28.290 --> 00:07:29.220 For grad students, 191 00:07:29.220 --> 00:07:31.413 you need to use your own research. 192 00:07:32.280 --> 00:07:35.850 So this dataset, we're gonna look at this data in a second, 193 00:07:35.850 --> 00:07:37.530 but the basic idea is this, 194 00:07:37.530 --> 00:07:39.660 I want you to think about how, 195 00:07:39.660 --> 00:07:42.780 what questions can we ask of this data? 196 00:07:42.780 --> 00:07:45.750 And then how can we turn those questions 197 00:07:45.750 --> 00:07:48.540 into specific metrics, variables, 198 00:07:48.540 --> 00:07:51.210 and fields in the data set, 199 00:07:51.210 --> 00:07:54.570 to find the answers to our questions, okay? 200 00:07:54.570 --> 00:07:56.520 That's what data analytics is all about. 201 00:07:56.520 --> 00:07:59.460 So you also don't wanna forget the 5 FUs. 202 00:07:59.460 --> 00:08:00.750 You're gonna notice things in the data 203 00:08:00.750 --> 00:08:02.340 as you're analyzing it, 204 00:08:02.340 --> 00:08:03.930 and you should be constantly asking, okay, 205 00:08:03.930 --> 00:08:04.763 well that's interesting, 206 00:08:04.763 --> 00:08:06.780 but what does that mean going deeper? 207 00:08:06.780 --> 00:08:11.100 Okay, now let's do a couple few together to start. 208 00:08:11.100 --> 00:08:13.650 All right, so let's look at the data set. 209 00:08:13.650 --> 00:08:17.670 So we have, as you see here, the name of the movie. 210 00:08:17.670 --> 00:08:20.910 And by the way, this is the IMDB top 1000, 211 00:08:20.910 --> 00:08:23.400 which is the top 1000 movies, I think, by the, 212 00:08:23.400 --> 00:08:25.320 what's called the META score here. 213 00:08:25.320 --> 00:08:27.840 Okay, so the META score is like a, 214 00:08:27.840 --> 00:08:32.310 merged together, aggregated, averaged out score 215 00:08:32.310 --> 00:08:36.210 based on critics ratings of the movie, I believe. 216 00:08:36.210 --> 00:08:38.370 So anyways, we have the name of the movie, 217 00:08:38.370 --> 00:08:41.280 we have the year it was released, runtime, 218 00:08:41.280 --> 00:08:42.270 we have the genre. 219 00:08:42.270 --> 00:08:43.890 And in fact, you'll notice that some of these are, 220 00:08:43.890 --> 00:08:46.260 it's more than one comma delimited. 221 00:08:46.260 --> 00:08:47.760 We'll talk more about that. 222 00:08:47.760 --> 00:08:49.410 We have the IMDB rating, 223 00:08:49.410 --> 00:08:52.560 which is what people visiting the IMDB website 224 00:08:52.560 --> 00:08:54.450 had given it as their rating 225 00:08:54.450 --> 00:08:57.000 up to a score of 10, I believe. 226 00:08:57.000 --> 00:08:57.833 We have an overview, 227 00:08:57.833 --> 00:09:00.390 which is just a short text description of the movie. 228 00:09:00.390 --> 00:09:01.890 We have that META score that I mentioned. 229 00:09:01.890 --> 00:09:06.240 Then we have the director and then four of the actors. 230 00:09:06.240 --> 00:09:08.370 And I don't know if these are in order, 231 00:09:08.370 --> 00:09:11.580 I don't know why it's star 1, 2, 3, 4 exactly. 232 00:09:11.580 --> 00:09:13.110 We also have the number of votes, 233 00:09:13.110 --> 00:09:14.910 the number of those people visiting the website 234 00:09:14.910 --> 00:09:16.980 giving it that IMDB rating. 235 00:09:16.980 --> 00:09:18.780 And we also have the gross, 236 00:09:18.780 --> 00:09:23.400 which is essentially the gross revenues that the movie made. 237 00:09:23.400 --> 00:09:25.860 So this is our data set. 238 00:09:25.860 --> 00:09:30.000 What questions might we ask of this data? 239 00:09:30.000 --> 00:09:31.050 I could think of a bunch 240 00:09:31.050 --> 00:09:33.420 of questions off the top of my head. 241 00:09:33.420 --> 00:09:38.310 I could ask questions like, which are the better movies, 242 00:09:38.310 --> 00:09:40.320 longer movies, or shorter movies? 243 00:09:40.320 --> 00:09:41.520 What does runtime have to do 244 00:09:41.520 --> 00:09:43.890 with quality in terms of the score? 245 00:09:43.890 --> 00:09:45.060 I could ask, 246 00:09:45.060 --> 00:09:48.210 are movies from different years or decades 247 00:09:48.210 --> 00:09:50.400 scored higher or lower? 248 00:09:50.400 --> 00:09:51.450 I could ask questions like 249 00:09:51.450 --> 00:09:54.900 which genres lead to higher or lower scores? 250 00:09:54.900 --> 00:09:57.540 I could ask, is the META score 251 00:09:57.540 --> 00:09:59.880 and the IMDB rating two different ways 252 00:09:59.880 --> 00:10:02.220 of looking at the quality of this movie? 253 00:10:02.220 --> 00:10:03.330 Are those correlated? 254 00:10:03.330 --> 00:10:05.070 Or is there a difference between those? 255 00:10:05.070 --> 00:10:06.360 How about what directors and 256 00:10:06.360 --> 00:10:09.030 or actors appear in the most popular movies? 257 00:10:09.030 --> 00:10:11.280 Or how many of them appear more than once? 258 00:10:11.280 --> 00:10:12.570 I could ask questions like 259 00:10:12.570 --> 00:10:14.460 does the number of votes correlate 260 00:10:14.460 --> 00:10:16.680 with that IMDB score? 261 00:10:16.680 --> 00:10:18.420 I could ask questions about profitability, 262 00:10:18.420 --> 00:10:20.190 gross or not profitability, 263 00:10:20.190 --> 00:10:21.120 cause I don't know what the costs are, 264 00:10:21.120 --> 00:10:22.620 but at least gross revenues 265 00:10:22.620 --> 00:10:26.220 is that correlated with quality in any which way? 266 00:10:26.220 --> 00:10:28.470 Those are just basic questions off the top of my head. 267 00:10:28.470 --> 00:10:31.620 So easy to and, to come up with. 268 00:10:31.620 --> 00:10:33.180 And I'm sure there are more nuanced questions 269 00:10:33.180 --> 00:10:35.250 we could explore as well, right? 270 00:10:35.250 --> 00:10:39.480 So this is what we need to do when we're analyzing data, 271 00:10:39.480 --> 00:10:42.870 is ask questions of our data like those. 272 00:10:42.870 --> 00:10:43.703 And by the way 273 00:10:43.703 --> 00:10:46.440 we may find once we've answered those questions 274 00:10:46.440 --> 00:10:48.420 that we find further questions. 275 00:10:48.420 --> 00:10:50.700 And in fact, when I introduced the analysis 276 00:10:50.700 --> 00:10:52.500 that I have done with this dataset, 277 00:10:52.500 --> 00:10:55.530 later on you'll notice I discovered an answer 278 00:10:55.530 --> 00:10:57.870 to one of those questions that I just mentioned 279 00:10:57.870 --> 00:11:00.780 and it led me to further questions, okay? 280 00:11:00.780 --> 00:11:03.900 So that's getting ready to do data analytics. 281 00:11:03.900 --> 00:11:06.750 And before we can move on, 282 00:11:06.750 --> 00:11:09.690 and we're gonna do that as another video in a second, 283 00:11:09.690 --> 00:11:13.260 I have to realize that you know what, data is garbage. 284 00:11:13.260 --> 00:11:14.093 It sucks. 285 00:11:14.093 --> 00:11:15.930 It's always a mess, okay? 286 00:11:15.930 --> 00:11:18.840 Which means we always have to clean it up. 287 00:11:18.840 --> 00:11:20.970 We always, I guess shouldn't say always, 288 00:11:20.970 --> 00:11:24.957 99.9999999% of the time, 289 00:11:24.957 --> 00:11:26.220 you have to mess with your data 290 00:11:26.220 --> 00:11:27.930 before you can actually analyze it. 291 00:11:27.930 --> 00:11:30.093 So that's what we're gonna do next.