WEBVTT 1 00:00:01.140 --> 00:00:03.540 All right, so when we ended the last video 2 00:00:03.540 --> 00:00:04.800 I talked about data cleaning. 3 00:00:04.800 --> 00:00:07.770 And your data is an absolute mess, 4 00:00:07.770 --> 00:00:10.320 and it is, you need to clean it up. 5 00:00:10.320 --> 00:00:14.190 And cleaning, data cleaning comes by many different terms. 6 00:00:14.190 --> 00:00:17.070 Data hygiene, data scrubbing, all kinds of things. 7 00:00:17.070 --> 00:00:18.780 Listen, like I said before, 8 00:00:18.780 --> 00:00:20.190 I'm gonna say it's slightly differently, 9 00:00:20.190 --> 00:00:21.300 your data's a dumpster fire. 10 00:00:21.300 --> 00:00:24.690 It always is a mess, okay, to varying degrees, 11 00:00:24.690 --> 00:00:26.130 but we gotta clean it up. 12 00:00:26.130 --> 00:00:27.930 And what does that mean? 13 00:00:27.930 --> 00:00:30.570 What are these data cleaning things we're gonna do? 14 00:00:30.570 --> 00:00:32.070 Well, as it says here 15 00:00:32.070 --> 00:00:35.880 it's steps to fix incorrect, inconsistent 16 00:00:35.880 --> 00:00:38.670 and/or incomplete data. 17 00:00:38.670 --> 00:00:40.230 So what are those steps? 18 00:00:40.230 --> 00:00:43.980 Well, we gotta start off by confirming the authenticity 19 00:00:43.980 --> 00:00:47.580 and/or the quality of the data source. 20 00:00:47.580 --> 00:00:49.530 Now I got this data that we're looking at, 21 00:00:49.530 --> 00:00:52.350 the IMDB dataset from Kaggle. 22 00:00:52.350 --> 00:00:55.020 I have not confirmed the authenticity 23 00:00:55.020 --> 00:00:57.330 or the quality of that dataset. 24 00:00:57.330 --> 00:01:00.180 Now we're just using it as an example dataset 25 00:01:00.180 --> 00:01:02.160 so we can get away with it, doesn't really matter 26 00:01:02.160 --> 00:01:03.600 if it's accurate or not, but boy, 27 00:01:03.600 --> 00:01:06.510 if I was gonna publish something based on that dataset 28 00:01:06.510 --> 00:01:09.663 I would do a much more careful job validating it. 29 00:01:10.500 --> 00:01:14.370 I also always, always, always 30 00:01:14.370 --> 00:01:17.760 will start by making a backup copy of that dataset 31 00:01:17.760 --> 00:01:20.670 before I do anything 'cause in data cleaning, 32 00:01:20.670 --> 00:01:22.890 I'm gonna be messing with it a lot. 33 00:01:22.890 --> 00:01:25.800 Deleting stuff, changing stuff potentially, 34 00:01:25.800 --> 00:01:28.230 merging, tweaking. 35 00:01:28.230 --> 00:01:30.480 I gotta have the original pure version 36 00:01:30.480 --> 00:01:32.730 in case I need to revert back to it. 37 00:01:32.730 --> 00:01:36.330 So always seriously, trust me, make a backup copy. 38 00:01:36.330 --> 00:01:39.710 It can be as simple as, take your spreadsheet 39 00:01:39.710 --> 00:01:42.990 if you're working in Excel, duplicate the worksheet, 40 00:01:42.990 --> 00:01:45.510 call the original one, pure, original, you know, 41 00:01:45.510 --> 00:01:48.240 untouched data, call the other one working data 42 00:01:48.240 --> 00:01:51.030 and never look at the pure original one ever again. 43 00:01:51.030 --> 00:01:54.720 Only do your analysis using the working data. 44 00:01:54.720 --> 00:01:58.530 Okay, we also, often in data cleaning, 45 00:01:58.530 --> 00:02:01.980 can just throw out extraneous data. 46 00:02:01.980 --> 00:02:04.380 Maybe we have a column that we're not gonna use. 47 00:02:04.380 --> 00:02:08.160 In the IMDB dataset there's that descriptive column, 48 00:02:08.160 --> 00:02:09.570 the overview column. 49 00:02:09.570 --> 00:02:12.240 I'm not gonna use that, I'm not doing any text analysis. 50 00:02:12.240 --> 00:02:14.280 I could throw that out if I wanted to. 51 00:02:14.280 --> 00:02:16.560 So always look to do that as part of your process. 52 00:02:16.560 --> 00:02:19.590 Why have it in your face to deal with it? 53 00:02:19.590 --> 00:02:20.940 We also want to check for things 54 00:02:20.940 --> 00:02:25.560 like standardized syntax and format consistencies. 55 00:02:25.560 --> 00:02:28.740 Things like capitalized cities or zips, 56 00:02:28.740 --> 00:02:33.660 zip codes with unallowed values, 57 00:02:33.660 --> 00:02:35.250 like in the U.S. it's either five digit 58 00:02:35.250 --> 00:02:37.020 or nine digit number, end of discussion. 59 00:02:37.020 --> 00:02:38.100 If you see letters in there, 60 00:02:38.100 --> 00:02:39.660 if you see something other number 61 00:02:39.660 --> 00:02:42.240 like a seven digit zip code, that's bad. 62 00:02:42.240 --> 00:02:44.580 You gotta figure out what you're gonna do about it. 63 00:02:44.580 --> 00:02:46.380 Brings up an important point. 64 00:02:46.380 --> 00:02:49.500 Data cleaning is two parts. 65 00:02:49.500 --> 00:02:52.890 You gotta notice the issue like a seven digit zip code. 66 00:02:52.890 --> 00:02:57.510 Okay, I see that, now we're gonna do about it? 67 00:02:57.510 --> 00:02:59.100 What are you gonna do about it is gonna vary. 68 00:02:59.100 --> 00:03:01.350 You may delete the data, it's just bad, right? 69 00:03:01.350 --> 00:03:03.840 You may change it, you may go research 70 00:03:03.840 --> 00:03:06.630 and look it up and replace it with the right value, right? 71 00:03:06.630 --> 00:03:07.740 Or if it's a numeric thing 72 00:03:07.740 --> 00:03:09.960 maybe you can calculate some fair replacement value 73 00:03:09.960 --> 00:03:11.940 like an average of some kind 74 00:03:11.940 --> 00:03:13.440 versus whatever number you see in there. 75 00:03:13.440 --> 00:03:15.060 Or maybe it's missing a number 76 00:03:15.060 --> 00:03:16.050 et cetera, et cetera, et cetera. 77 00:03:16.050 --> 00:03:20.130 So it depends in a huge way what you do. 78 00:03:20.130 --> 00:03:21.600 But first you gotta notice it 79 00:03:21.600 --> 00:03:23.760 and then you can figure out what to do about it. 80 00:03:23.760 --> 00:03:26.340 You can also look for outliers in your dataset. 81 00:03:26.340 --> 00:03:29.340 Maybe I have in this top 1000 movies 82 00:03:29.340 --> 00:03:32.190 one movie that has a meta score of three 83 00:03:32.190 --> 00:03:34.470 and all the other ones are like seven or above. 84 00:03:34.470 --> 00:03:36.090 Well, that's weird, that's an outlier. 85 00:03:36.090 --> 00:03:39.360 Is that A, a mistake, maybe I just delete it, 86 00:03:39.360 --> 00:03:41.460 B, does that mean like every other movie 87 00:03:41.460 --> 00:03:44.790 aside from the top 1000 has meta scores below three? 88 00:03:44.790 --> 00:03:46.200 Like you're gonna look into it 89 00:03:46.200 --> 00:03:48.840 before you decide what to do about it. 90 00:03:48.840 --> 00:03:50.640 You can also check for completeness 91 00:03:50.640 --> 00:03:53.460 like missing fields, missing values, 92 00:03:53.460 --> 00:03:55.650 and again, figure out what to do about it 93 00:03:55.650 --> 00:03:58.380 after you first notice the problem. 94 00:03:58.380 --> 00:04:01.320 So when we are data cleaning 95 00:04:01.320 --> 00:04:03.570 one of the very first techniques we can use 96 00:04:03.570 --> 00:04:06.540 to do it is to start with sorting and filtering. 97 00:04:06.540 --> 00:04:09.540 Okay, sort every column you have, 98 00:04:09.540 --> 00:04:12.000 you're looking for weird things, right? 99 00:04:12.000 --> 00:04:13.890 Sort from largest to smallest 100 00:04:13.890 --> 00:04:17.550 and smallest to largest and every numeric field 101 00:04:17.550 --> 00:04:19.440 and you may notice, oh there's a bunch of zeros 102 00:04:19.440 --> 00:04:21.240 in that field, but most of the numbers 103 00:04:21.240 --> 00:04:22.890 are supposed to be like a thousand or above. 104 00:04:22.890 --> 00:04:24.450 That's weird, right? 105 00:04:24.450 --> 00:04:27.900 Or a bunch of blanks or a bunch of really high numbers 106 00:04:27.900 --> 00:04:30.180 that don't seem right, et cetera, et cetera. 107 00:04:30.180 --> 00:04:31.890 Sorting, as simple as it is 108 00:04:31.890 --> 00:04:35.790 is a great way to start and definitely filtering. 109 00:04:35.790 --> 00:04:37.770 You can filter, use the filter button. 110 00:04:37.770 --> 00:04:39.060 In fact, one of the first things I do 111 00:04:39.060 --> 00:04:42.240 in any dataset is I enable filtering in Excel 112 00:04:42.240 --> 00:04:45.450 in order to be able to see what things I can turn on 113 00:04:45.450 --> 00:04:47.310 and off, filter the data down 114 00:04:47.310 --> 00:04:50.790 and get a sense of that segmentation, get a sense of, 115 00:04:50.790 --> 00:04:54.060 by the way, how many unique values are in a column. 116 00:04:54.060 --> 00:04:57.000 If I have a column a thousand rows deep 117 00:04:57.000 --> 00:04:59.550 I don't know what the unique values are, 118 00:04:59.550 --> 00:05:00.660 but by enabling filtering 119 00:05:00.660 --> 00:05:02.370 and popping open that little window, 120 00:05:02.370 --> 00:05:04.860 I can see how many options there are in there. 121 00:05:04.860 --> 00:05:07.110 And we'll look at that a little bit later on. 122 00:05:07.110 --> 00:05:12.110 So we start off, we make a backup copy of our data, 123 00:05:12.570 --> 00:05:15.960 then we can actually start to look into it. 124 00:05:15.960 --> 00:05:19.710 And by the way, as we are doing our data cleaning 125 00:05:19.710 --> 00:05:23.820 you want to document every single decision, 126 00:05:23.820 --> 00:05:25.830 every single thing you do. 127 00:05:25.830 --> 00:05:28.590 I deleted this column, you're gonna write that down. 128 00:05:28.590 --> 00:05:32.407 Add a notes tab to your spreadsheet that says, 129 00:05:32.407 --> 00:05:34.320 "Step one deleted column." 130 00:05:34.320 --> 00:05:38.730 You know, because I'm not using those descriptors. 131 00:05:38.730 --> 00:05:42.330 I imputed, I replaced the value for, you know, 132 00:05:42.330 --> 00:05:45.600 the year released, you know, for "Shawshank Redemption" 133 00:05:45.600 --> 00:05:48.210 because it was wrong, things like that. 134 00:05:48.210 --> 00:05:51.240 Write it down, because you need to remember what you did, 135 00:05:51.240 --> 00:05:53.070 you may need to repeat what you did, 136 00:05:53.070 --> 00:05:54.510 and you're gonna need to explain 137 00:05:54.510 --> 00:05:57.930 and defend what you did to other people down the the road. 138 00:05:57.930 --> 00:06:00.030 This is a great resource from Quartz, 139 00:06:00.030 --> 00:06:03.180 a good journalistic publication 140 00:06:03.180 --> 00:06:04.530 about what to do with bad data. 141 00:06:04.530 --> 00:06:07.470 Definitely check it out for more information about this. 142 00:06:07.470 --> 00:06:12.470 Now thinking about the IMDB data, let's ask ourselves 143 00:06:12.810 --> 00:06:16.920 what do we notice, 'cause remember step one, notice it. 144 00:06:16.920 --> 00:06:18.600 Then step two, we'll figure out what to do about it. 145 00:06:18.600 --> 00:06:20.040 So what do we notice 146 00:06:20.040 --> 00:06:23.460 in the IMDB dataset that maybe needs some tweaking? 147 00:06:23.460 --> 00:06:28.080 And we gotta do this with our data literacy hats on okay. 148 00:06:28.080 --> 00:06:29.850 So let's look at the dataset. 149 00:06:29.850 --> 00:06:33.210 Series Title, I probably want the title 150 00:06:33.210 --> 00:06:35.127 even though I may not analyze the title 151 00:06:35.127 --> 00:06:37.620 I'm probably gonna want those titles to look at 152 00:06:37.620 --> 00:06:39.150 'cause I know some of these movies 153 00:06:39.150 --> 00:06:40.920 so I'm gonna hold onto that one. 154 00:06:40.920 --> 00:06:44.040 Released year, now, (claps hands) 155 00:06:44.040 --> 00:06:46.140 Released Year, that seems like a column 156 00:06:46.140 --> 00:06:48.000 I'm gonna want to keep. 157 00:06:48.000 --> 00:06:51.690 But outta curiosity, I said maybe we should sort 158 00:06:51.690 --> 00:06:54.690 let me sort, smallest to, well, 159 00:06:54.690 --> 00:06:57.240 so the oldest movie in this dataset is 1920. 160 00:06:57.240 --> 00:06:59.880 That's interesting, I didn't know that until I sorted, 161 00:06:59.880 --> 00:07:01.533 what's the most recent one? 162 00:07:02.370 --> 00:07:05.010 Oh, PG, what? 163 00:07:05.010 --> 00:07:07.110 That's, that doesn't make any sense. 164 00:07:07.110 --> 00:07:09.300 I already have a data cleaning task, right? 165 00:07:09.300 --> 00:07:14.100 So I need to fix that or do something about that, 166 00:07:14.100 --> 00:07:15.720 we'll see about that. 167 00:07:15.720 --> 00:07:18.840 Runtime, that's good to have. 168 00:07:18.840 --> 00:07:22.080 It's kind of a weird one, data literacy hat. 169 00:07:22.080 --> 00:07:24.600 It's numbers and text. 170 00:07:24.600 --> 00:07:27.930 140 space M-I-N. 171 00:07:27.930 --> 00:07:30.240 Maybe I'm gonna want to do something about that 172 00:07:30.240 --> 00:07:31.470 so that I have numbers in there 173 00:07:31.470 --> 00:07:34.020 so I can do math with those numbers. 174 00:07:34.020 --> 00:07:35.610 So right now I'm just identifying issues, 175 00:07:35.610 --> 00:07:38.040 I'm not gonna fix them and we'll fix those later. 176 00:07:38.040 --> 00:07:42.060 Genre, like I mentioned before, we have three genres 177 00:07:42.060 --> 00:07:43.500 or sometimes it's not three 178 00:07:43.500 --> 00:07:47.250 sometimes it's one or two, separated by columns. 179 00:07:47.250 --> 00:07:50.970 So what if I want to figure out which genres are the best, 180 00:07:50.970 --> 00:07:53.340 have the highest scores, can I do that 181 00:07:53.340 --> 00:07:55.290 if they're all in one column? 182 00:07:55.290 --> 00:07:56.550 There's probably a way to do that. 183 00:07:56.550 --> 00:07:58.740 But I think I'm gonna want to do something else with this, 184 00:07:58.740 --> 00:08:00.000 we'll talk about that. 185 00:08:00.000 --> 00:08:02.730 IMDB Rating, numeric, it's all good. 186 00:08:02.730 --> 00:08:04.650 Let me just sort that one outta curiosity. 187 00:08:04.650 --> 00:08:05.940 I didn't sort the Runtime 188 00:08:05.940 --> 00:08:08.460 but I probably would under normal circumstances. 189 00:08:08.460 --> 00:08:11.100 Sorting, whoops, decreasing. 190 00:08:11.100 --> 00:08:12.690 Yeah, so at least those are all numbers. 191 00:08:12.690 --> 00:08:14.190 Overview, like I said, maybe I'll delete that. 192 00:08:14.190 --> 00:08:15.300 We'll figure it out. 193 00:08:15.300 --> 00:08:18.780 Meta score, so if I sort the Meta score 194 00:08:18.780 --> 00:08:22.140 from smallest to largest, okay, there's a 28, 195 00:08:22.140 --> 00:08:24.630 and how about the largest is a hundred, 196 00:08:24.630 --> 00:08:26.640 but by the way, I didn't enable filtering yet. 197 00:08:26.640 --> 00:08:31.350 Let me actually enable filtering, which brings me this. 198 00:08:31.350 --> 00:08:33.360 And if I pop that open, look what happens. 199 00:08:33.360 --> 00:08:36.600 I can see how many different meta scores there are. 200 00:08:36.600 --> 00:08:38.250 And by the way, what do I notice? 201 00:08:38.250 --> 00:08:42.360 Ooh, blanks, interesting. 202 00:08:42.360 --> 00:08:43.800 So let me filter. 203 00:08:43.800 --> 00:08:48.800 Let me unselect all and let me just select blanks. 204 00:08:49.020 --> 00:08:51.780 I wanna know how many blanks I have. 205 00:08:51.780 --> 00:08:55.210 And so now by doing that, what I can see is 206 00:08:56.670 --> 00:09:00.000 Let's see how many rows I have that are blank. 207 00:09:00.000 --> 00:09:05.000 So if I shift, select down, good to know Excel shortcuts, 208 00:09:06.030 --> 00:09:09.390 I select all the rows that are now not filtered out 209 00:09:09.390 --> 00:09:13.440 and I can see the count right down here, 157 rows. 210 00:09:13.440 --> 00:09:18.060 I have 157 values or rows of data in this dataset 211 00:09:18.060 --> 00:09:20.250 that don't have that meta score. 212 00:09:20.250 --> 00:09:23.130 So how are they the top 1000, maybe the top 1000 213 00:09:23.130 --> 00:09:26.400 is based on the IMDB rating I guess, I don't know. 214 00:09:26.400 --> 00:09:28.200 I'll look into that maybe later on. 215 00:09:28.200 --> 00:09:30.270 I also by the way, data cleaning, may notice that some 216 00:09:30.270 --> 00:09:31.200 of these movie titles 217 00:09:31.200 --> 00:09:33.600 have these weird characters showing up. 218 00:09:33.600 --> 00:09:36.750 This is probably just a character translation issue. 219 00:09:36.750 --> 00:09:39.210 I'm not gonna worry about that today in a normal dataset. 220 00:09:39.210 --> 00:09:41.490 I would definitely be concerned about that. 221 00:09:41.490 --> 00:09:45.390 I have my different actors, directors in different columns. 222 00:09:45.390 --> 00:09:46.590 I'm fine with that for now, 223 00:09:46.590 --> 00:09:48.930 although you'll see later I do something about that. 224 00:09:48.930 --> 00:09:50.640 Number of Votes. 225 00:09:50.640 --> 00:09:55.560 I noticed that if I sort, I have 25,000 votes on the low end 226 00:09:55.560 --> 00:09:58.710 and I have 300,000 something votes on the high end, okay? 227 00:09:58.710 --> 00:10:00.150 I'm not sure what to make about that. 228 00:10:00.150 --> 00:10:02.430 I also noticed by the way, under Gross, 229 00:10:02.430 --> 00:10:03.870 that we have a lot of movies 230 00:10:03.870 --> 00:10:05.760 where I don't have the gross revenues. 231 00:10:05.760 --> 00:10:09.780 Okay, so a quick scan through this data 232 00:10:09.780 --> 00:10:10.920 I learned a couple things. 233 00:10:10.920 --> 00:10:13.530 I certainly noticed a couple of issues. 234 00:10:13.530 --> 00:10:14.580 Long story short, 235 00:10:14.580 --> 00:10:17.310 we're gonna work on cleaning it up together, 236 00:10:17.310 --> 00:10:19.010 which is what we're gonna do next.