WEBVTT

1
00:00:01.140 --> 00:00:03.540
<v ->All right, so when we ended the last video</v>

2
00:00:03.540 --> 00:00:04.800
I talked about data cleaning.

3
00:00:04.800 --> 00:00:07.770
And your data is an absolute mess,

4
00:00:07.770 --> 00:00:10.320
and it is, you need to clean it up.

5
00:00:10.320 --> 00:00:14.190
And cleaning, data cleaning comes by many different terms.

6
00:00:14.190 --> 00:00:17.070
Data hygiene, data scrubbing, all kinds of things.

7
00:00:17.070 --> 00:00:18.780
Listen, like I said before,

8
00:00:18.780 --> 00:00:20.190
I'm gonna say it's slightly differently,

9
00:00:20.190 --> 00:00:21.300
your data's a dumpster fire.

10
00:00:21.300 --> 00:00:24.690
It always is a mess, okay, to varying degrees,

11
00:00:24.690 --> 00:00:26.130
but we gotta clean it up.

12
00:00:26.130 --> 00:00:27.930
And what does that mean?

13
00:00:27.930 --> 00:00:30.570
What are these data cleaning things we're gonna do?

14
00:00:30.570 --> 00:00:32.070
Well, as it says here

15
00:00:32.070 --> 00:00:35.880
it's steps to fix incorrect, inconsistent

16
00:00:35.880 --> 00:00:38.670
and/or incomplete data.

17
00:00:38.670 --> 00:00:40.230
So what are those steps?

18
00:00:40.230 --> 00:00:43.980
Well, we gotta start off by confirming the authenticity

19
00:00:43.980 --> 00:00:47.580
and/or the quality of the data source.

20
00:00:47.580 --> 00:00:49.530
Now I got this data that we're looking at,

21
00:00:49.530 --> 00:00:52.350
the IMDB dataset from Kaggle.

22
00:00:52.350 --> 00:00:55.020
I have not confirmed the authenticity

23
00:00:55.020 --> 00:00:57.330
or the quality of that dataset.

24
00:00:57.330 --> 00:01:00.180
Now we're just using it as an example dataset

25
00:01:00.180 --> 00:01:02.160
so we can get away with it, doesn't really matter

26
00:01:02.160 --> 00:01:03.600
if it's accurate or not, but boy,

27
00:01:03.600 --> 00:01:06.510
if I was gonna publish something based on that dataset

28
00:01:06.510 --> 00:01:09.663
I would do a much more careful job validating it.

29
00:01:10.500 --> 00:01:14.370
I also always, always, always

30
00:01:14.370 --> 00:01:17.760
will start by making a backup copy of that dataset

31
00:01:17.760 --> 00:01:20.670
before I do anything 'cause in data cleaning,

32
00:01:20.670 --> 00:01:22.890
I'm gonna be messing with it a lot.

33
00:01:22.890 --> 00:01:25.800
Deleting stuff, changing stuff potentially,

34
00:01:25.800 --> 00:01:28.230
merging, tweaking.

35
00:01:28.230 --> 00:01:30.480
I gotta have the original pure version

36
00:01:30.480 --> 00:01:32.730
in case I need to revert back to it.

37
00:01:32.730 --> 00:01:36.330
So always seriously, trust me, make a backup copy.

38
00:01:36.330 --> 00:01:39.710
It can be as simple as, take your spreadsheet

39
00:01:39.710 --> 00:01:42.990
if you're working in Excel, duplicate the worksheet,

40
00:01:42.990 --> 00:01:45.510
call the original one, pure, original, you know,

41
00:01:45.510 --> 00:01:48.240
untouched data, call the other one working data

42
00:01:48.240 --> 00:01:51.030
and never look at the pure original one ever again.

43
00:01:51.030 --> 00:01:54.720
Only do your analysis using the working data.

44
00:01:54.720 --> 00:01:58.530
Okay, we also, often in data cleaning,

45
00:01:58.530 --> 00:02:01.980
can just throw out extraneous data.

46
00:02:01.980 --> 00:02:04.380
Maybe we have a column that we're not gonna use.

47
00:02:04.380 --> 00:02:08.160
In the IMDB dataset there's that descriptive column,

48
00:02:08.160 --> 00:02:09.570
the overview column.

49
00:02:09.570 --> 00:02:12.240
I'm not gonna use that, I'm not doing any text analysis.

50
00:02:12.240 --> 00:02:14.280
I could throw that out if I wanted to.

51
00:02:14.280 --> 00:02:16.560
So always look to do that as part of your process.

52
00:02:16.560 --> 00:02:19.590
Why have it in your face to deal with it?

53
00:02:19.590 --> 00:02:20.940
We also want to check for things

54
00:02:20.940 --> 00:02:25.560
like standardized syntax and format consistencies.

55
00:02:25.560 --> 00:02:28.740
Things like capitalized cities or zips,

56
00:02:28.740 --> 00:02:33.660
zip codes with unallowed values,

57
00:02:33.660 --> 00:02:35.250
like in the U.S. it's either five digit

58
00:02:35.250 --> 00:02:37.020
or nine digit number, end of discussion.

59
00:02:37.020 --> 00:02:38.100
If you see letters in there,

60
00:02:38.100 --> 00:02:39.660
if you see something other number

61
00:02:39.660 --> 00:02:42.240
like a seven digit zip code, that's bad.

62
00:02:42.240 --> 00:02:44.580
You gotta figure out what you're gonna do about it.

63
00:02:44.580 --> 00:02:46.380
Brings up an important point.

64
00:02:46.380 --> 00:02:49.500
Data cleaning is two parts.

65
00:02:49.500 --> 00:02:52.890
You gotta notice the issue like a seven digit zip code.

66
00:02:52.890 --> 00:02:57.510
Okay, I see that, now we're gonna do about it?

67
00:02:57.510 --> 00:02:59.100
What are you gonna do about it is gonna vary.

68
00:02:59.100 --> 00:03:01.350
You may delete the data, it's just bad, right?

69
00:03:01.350 --> 00:03:03.840
You may change it, you may go research

70
00:03:03.840 --> 00:03:06.630
and look it up and replace it with the right value, right?

71
00:03:06.630 --> 00:03:07.740
Or if it's a numeric thing

72
00:03:07.740 --> 00:03:09.960
maybe you can calculate some fair replacement value

73
00:03:09.960 --> 00:03:11.940
like an average of some kind

74
00:03:11.940 --> 00:03:13.440
versus whatever number you see in there.

75
00:03:13.440 --> 00:03:15.060
Or maybe it's missing a number

76
00:03:15.060 --> 00:03:16.050
et cetera, et cetera, et cetera.

77
00:03:16.050 --> 00:03:20.130
So it depends in a huge way what you do.

78
00:03:20.130 --> 00:03:21.600
But first you gotta notice it

79
00:03:21.600 --> 00:03:23.760
and then you can figure out what to do about it.

80
00:03:23.760 --> 00:03:26.340
You can also look for outliers in your dataset.

81
00:03:26.340 --> 00:03:29.340
Maybe I have in this top 1000 movies

82
00:03:29.340 --> 00:03:32.190
one movie that has a meta score of three

83
00:03:32.190 --> 00:03:34.470
and all the other ones are like seven or above.

84
00:03:34.470 --> 00:03:36.090
Well, that's weird, that's an outlier.

85
00:03:36.090 --> 00:03:39.360
Is that A, a mistake, maybe I just delete it,

86
00:03:39.360 --> 00:03:41.460
B, does that mean like every other movie

87
00:03:41.460 --> 00:03:44.790
aside from the top 1000 has meta scores below three?

88
00:03:44.790 --> 00:03:46.200
Like you're gonna look into it

89
00:03:46.200 --> 00:03:48.840
before you decide what to do about it.

90
00:03:48.840 --> 00:03:50.640
You can also check for completeness

91
00:03:50.640 --> 00:03:53.460
like missing fields, missing values,

92
00:03:53.460 --> 00:03:55.650
and again, figure out what to do about it

93
00:03:55.650 --> 00:03:58.380
after you first notice the problem.

94
00:03:58.380 --> 00:04:01.320
So when we are data cleaning

95
00:04:01.320 --> 00:04:03.570
one of the very first techniques we can use

96
00:04:03.570 --> 00:04:06.540
to do it is to start with sorting and filtering.

97
00:04:06.540 --> 00:04:09.540
Okay, sort every column you have,

98
00:04:09.540 --> 00:04:12.000
you're looking for weird things, right?

99
00:04:12.000 --> 00:04:13.890
Sort from largest to smallest

100
00:04:13.890 --> 00:04:17.550
and smallest to largest and every numeric field

101
00:04:17.550 --> 00:04:19.440
and you may notice, oh there's a bunch of zeros

102
00:04:19.440 --> 00:04:21.240
in that field, but most of the numbers

103
00:04:21.240 --> 00:04:22.890
are supposed to be like a thousand or above.

104
00:04:22.890 --> 00:04:24.450
That's weird, right?

105
00:04:24.450 --> 00:04:27.900
Or a bunch of blanks or a bunch of really high numbers

106
00:04:27.900 --> 00:04:30.180
that don't seem right, et cetera, et cetera.

107
00:04:30.180 --> 00:04:31.890
Sorting, as simple as it is

108
00:04:31.890 --> 00:04:35.790
is a great way to start and definitely filtering.

109
00:04:35.790 --> 00:04:37.770
You can filter, use the filter button.

110
00:04:37.770 --> 00:04:39.060
In fact, one of the first things I do

111
00:04:39.060 --> 00:04:42.240
in any dataset is I enable filtering in Excel

112
00:04:42.240 --> 00:04:45.450
in order to be able to see what things I can turn on

113
00:04:45.450 --> 00:04:47.310
and off, filter the data down

114
00:04:47.310 --> 00:04:50.790
and get a sense of that segmentation, get a sense of,

115
00:04:50.790 --> 00:04:54.060
by the way, how many unique values are in a column.

116
00:04:54.060 --> 00:04:57.000
If I have a column a thousand rows deep

117
00:04:57.000 --> 00:04:59.550
I don't know what the unique values are,

118
00:04:59.550 --> 00:05:00.660
but by enabling filtering

119
00:05:00.660 --> 00:05:02.370
and popping open that little window,

120
00:05:02.370 --> 00:05:04.860
I can see how many options there are in there.

121
00:05:04.860 --> 00:05:07.110
And we'll look at that a little bit later on.

122
00:05:07.110 --> 00:05:12.110
So we start off, we make a backup copy of our data,

123
00:05:12.570 --> 00:05:15.960
then we can actually start to look into it.

124
00:05:15.960 --> 00:05:19.710
And by the way, as we are doing our data cleaning

125
00:05:19.710 --> 00:05:23.820
you want to document every single decision,

126
00:05:23.820 --> 00:05:25.830
every single thing you do.

127
00:05:25.830 --> 00:05:28.590
I deleted this column, you're gonna write that down.

128
00:05:28.590 --> 00:05:32.407
Add a notes tab to your spreadsheet that says,

129
00:05:32.407 --> 00:05:34.320
"Step one deleted column."

130
00:05:34.320 --> 00:05:38.730
You know, because I'm not using those descriptors.

131
00:05:38.730 --> 00:05:42.330
I imputed, I replaced the value for, you know,

132
00:05:42.330 --> 00:05:45.600
the year released, you know, for "Shawshank Redemption"

133
00:05:45.600 --> 00:05:48.210
because it was wrong, things like that.

134
00:05:48.210 --> 00:05:51.240
Write it down, because you need to remember what you did,

135
00:05:51.240 --> 00:05:53.070
you may need to repeat what you did,

136
00:05:53.070 --> 00:05:54.510
and you're gonna need to explain

137
00:05:54.510 --> 00:05:57.930
and defend what you did to other people down the the road.

138
00:05:57.930 --> 00:06:00.030
This is a great resource from Quartz,

139
00:06:00.030 --> 00:06:03.180
a good journalistic publication

140
00:06:03.180 --> 00:06:04.530
about what to do with bad data.

141
00:06:04.530 --> 00:06:07.470
Definitely check it out for more information about this.

142
00:06:07.470 --> 00:06:12.470
Now thinking about the IMDB data, let's ask ourselves

143
00:06:12.810 --> 00:06:16.920
what do we notice, 'cause remember step one, notice it.

144
00:06:16.920 --> 00:06:18.600
Then step two, we'll figure out what to do about it.

145
00:06:18.600 --> 00:06:20.040
So what do we notice

146
00:06:20.040 --> 00:06:23.460
in the IMDB dataset that maybe needs some tweaking?

147
00:06:23.460 --> 00:06:28.080
And we gotta do this with our data literacy hats on okay.

148
00:06:28.080 --> 00:06:29.850
So let's look at the dataset.

149
00:06:29.850 --> 00:06:33.210
Series Title, I probably want the title

150
00:06:33.210 --> 00:06:35.127
even though I may not analyze the title

151
00:06:35.127 --> 00:06:37.620
I'm probably gonna want those titles to look at

152
00:06:37.620 --> 00:06:39.150
'cause I know some of these movies

153
00:06:39.150 --> 00:06:40.920
so I'm gonna hold onto that one.

154
00:06:40.920 --> 00:06:44.040
Released year, now, (claps hands)

155
00:06:44.040 --> 00:06:46.140
Released Year, that seems like a column

156
00:06:46.140 --> 00:06:48.000
I'm gonna want to keep.

157
00:06:48.000 --> 00:06:51.690
But outta curiosity, I said maybe we should sort

158
00:06:51.690 --> 00:06:54.690
let me sort, smallest to, well,

159
00:06:54.690 --> 00:06:57.240
so the oldest movie in this dataset is 1920.

160
00:06:57.240 --> 00:06:59.880
That's interesting, I didn't know that until I sorted,

161
00:06:59.880 --> 00:07:01.533
what's the most recent one?

162
00:07:02.370 --> 00:07:05.010
Oh, PG, what?

163
00:07:05.010 --> 00:07:07.110
That's, that doesn't make any sense.

164
00:07:07.110 --> 00:07:09.300
I already have a data cleaning task, right?

165
00:07:09.300 --> 00:07:14.100
So I need to fix that or do something about that,

166
00:07:14.100 --> 00:07:15.720
we'll see about that.

167
00:07:15.720 --> 00:07:18.840
Runtime, that's good to have.

168
00:07:18.840 --> 00:07:22.080
It's kind of a weird one, data literacy hat.

169
00:07:22.080 --> 00:07:24.600
It's numbers and text.

170
00:07:24.600 --> 00:07:27.930
140 space M-I-N.

171
00:07:27.930 --> 00:07:30.240
Maybe I'm gonna want to do something about that

172
00:07:30.240 --> 00:07:31.470
so that I have numbers in there

173
00:07:31.470 --> 00:07:34.020
so I can do math with those numbers.

174
00:07:34.020 --> 00:07:35.610
So right now I'm just identifying issues,

175
00:07:35.610 --> 00:07:38.040
I'm not gonna fix them and we'll fix those later.

176
00:07:38.040 --> 00:07:42.060
Genre, like I mentioned before, we have three genres

177
00:07:42.060 --> 00:07:43.500
or sometimes it's not three

178
00:07:43.500 --> 00:07:47.250
sometimes it's one or two, separated by columns.

179
00:07:47.250 --> 00:07:50.970
So what if I want to figure out which genres are the best,

180
00:07:50.970 --> 00:07:53.340
have the highest scores, can I do that

181
00:07:53.340 --> 00:07:55.290
if they're all in one column?

182
00:07:55.290 --> 00:07:56.550
There's probably a way to do that.

183
00:07:56.550 --> 00:07:58.740
But I think I'm gonna want to do something else with this,

184
00:07:58.740 --> 00:08:00.000
we'll talk about that.

185
00:08:00.000 --> 00:08:02.730
IMDB Rating, numeric, it's all good.

186
00:08:02.730 --> 00:08:04.650
Let me just sort that one outta curiosity.

187
00:08:04.650 --> 00:08:05.940
I didn't sort the Runtime

188
00:08:05.940 --> 00:08:08.460
but I probably would under normal circumstances.

189
00:08:08.460 --> 00:08:11.100
Sorting, whoops, decreasing.

190
00:08:11.100 --> 00:08:12.690
Yeah, so at least those are all numbers.

191
00:08:12.690 --> 00:08:14.190
Overview, like I said, maybe I'll delete that.

192
00:08:14.190 --> 00:08:15.300
We'll figure it out.

193
00:08:15.300 --> 00:08:18.780
Meta score, so if I sort the Meta score

194
00:08:18.780 --> 00:08:22.140
from smallest to largest, okay, there's a 28,

195
00:08:22.140 --> 00:08:24.630
and how about the largest is a hundred,

196
00:08:24.630 --> 00:08:26.640
but by the way, I didn't enable filtering yet.

197
00:08:26.640 --> 00:08:31.350
Let me actually enable filtering, which brings me this.

198
00:08:31.350 --> 00:08:33.360
And if I pop that open, look what happens.

199
00:08:33.360 --> 00:08:36.600
I can see how many different meta scores there are.

200
00:08:36.600 --> 00:08:38.250
And by the way, what do I notice?

201
00:08:38.250 --> 00:08:42.360
Ooh, blanks, interesting.

202
00:08:42.360 --> 00:08:43.800
So let me filter.

203
00:08:43.800 --> 00:08:48.800
Let me unselect all and let me just select blanks.

204
00:08:49.020 --> 00:08:51.780
I wanna know how many blanks I have.

205
00:08:51.780 --> 00:08:55.210
And so now by doing that, what I can see is

206
00:08:56.670 --> 00:09:00.000
Let's see how many rows I have that are blank.

207
00:09:00.000 --> 00:09:05.000
So if I shift, select down, good to know Excel shortcuts,

208
00:09:06.030 --> 00:09:09.390
I select all the rows that are now not filtered out

209
00:09:09.390 --> 00:09:13.440
and I can see the count right down here, 157 rows.

210
00:09:13.440 --> 00:09:18.060
I have 157 values or rows of data in this dataset

211
00:09:18.060 --> 00:09:20.250
that don't have that meta score.

212
00:09:20.250 --> 00:09:23.130
So how are they the top 1000, maybe the top 1000

213
00:09:23.130 --> 00:09:26.400
is based on the IMDB rating I guess, I don't know.

214
00:09:26.400 --> 00:09:28.200
I'll look into that maybe later on.

215
00:09:28.200 --> 00:09:30.270
I also by the way, data cleaning, may notice that some

216
00:09:30.270 --> 00:09:31.200
of these movie titles

217
00:09:31.200 --> 00:09:33.600
have these weird characters showing up.

218
00:09:33.600 --> 00:09:36.750
This is probably just a character translation issue.

219
00:09:36.750 --> 00:09:39.210
I'm not gonna worry about that today in a normal dataset.

220
00:09:39.210 --> 00:09:41.490
I would definitely be concerned about that.

221
00:09:41.490 --> 00:09:45.390
I have my different actors, directors in different columns.

222
00:09:45.390 --> 00:09:46.590
I'm fine with that for now,

223
00:09:46.590 --> 00:09:48.930
although you'll see later I do something about that.

224
00:09:48.930 --> 00:09:50.640
Number of Votes.

225
00:09:50.640 --> 00:09:55.560
I noticed that if I sort, I have 25,000 votes on the low end

226
00:09:55.560 --> 00:09:58.710
and I have 300,000 something votes on the high end, okay?

227
00:09:58.710 --> 00:10:00.150
I'm not sure what to make about that.

228
00:10:00.150 --> 00:10:02.430
I also noticed by the way, under Gross,

229
00:10:02.430 --> 00:10:03.870
that we have a lot of movies

230
00:10:03.870 --> 00:10:05.760
where I don't have the gross revenues.

231
00:10:05.760 --> 00:10:09.780
Okay, so a quick scan through this data

232
00:10:09.780 --> 00:10:10.920
I learned a couple things.

233
00:10:10.920 --> 00:10:13.530
I certainly noticed a couple of issues.

234
00:10:13.530 --> 00:10:14.580
Long story short,

235
00:10:14.580 --> 00:10:17.310
we're gonna work on cleaning it up together,

236
00:10:17.310 --> 00:10:19.010
which is what we're gonna do next.