WEBVTT 1 00:00:00.540 --> 00:00:02.790 All right, so once we have our data all cleaned up, 2 00:00:02.790 --> 00:00:05.340 we can actually start to analyze it. 3 00:00:05.340 --> 00:00:07.200 And so let's think about some of the things we can do 4 00:00:07.200 --> 00:00:08.550 when we're doing data analytics. 5 00:00:08.550 --> 00:00:11.370 And to start, you can take the same steps 6 00:00:11.370 --> 00:00:12.600 we started with data cleaning, 7 00:00:12.600 --> 00:00:14.400 which is filtering and sorting. 8 00:00:14.400 --> 00:00:18.210 Like get an initial sense of the minimum and maximum values, 9 00:00:18.210 --> 00:00:20.340 not to identify things to delete or whatever, 10 00:00:20.340 --> 00:00:23.910 but to literally find out the range of our data set, right? 11 00:00:23.910 --> 00:00:26.640 We can also do things that look like summary statistics. 12 00:00:26.640 --> 00:00:29.790 Always a great place to start with data analysis 13 00:00:29.790 --> 00:00:33.240 is to literally get those summary values, 14 00:00:33.240 --> 00:00:37.560 minimum, maximum, average, sum, et cetera, 15 00:00:37.560 --> 00:00:39.960 of any of our numeric columns. 16 00:00:39.960 --> 00:00:41.730 And even for the non-numeric columns, 17 00:00:41.730 --> 00:00:44.100 maybe you're getting a count of things, 18 00:00:44.100 --> 00:00:47.040 a count of like true/falses or things like that. 19 00:00:47.040 --> 00:00:51.583 So, summary statistics guided by your questions, right? 20 00:00:51.583 --> 00:00:55.350 I don't need to count the number of actors 21 00:00:55.350 --> 00:00:58.260 in all these movies unless that's part of my question. 22 00:00:58.260 --> 00:01:00.660 But if one of my questions of the dataset is 23 00:01:00.660 --> 00:01:04.800 how many actors appeared in more than five of these movies? 24 00:01:04.800 --> 00:01:07.560 Then I'm gonna be looking at data like that, right? 25 00:01:07.560 --> 00:01:09.840 Always guided by your questions. 26 00:01:09.840 --> 00:01:13.430 One of the other things that I always, always do 27 00:01:13.430 --> 00:01:15.510 at the very beginning of just about 28 00:01:15.510 --> 00:01:19.560 every data analytics project 29 00:01:19.560 --> 00:01:22.803 is I like to look at distributions of my data set. 30 00:01:23.730 --> 00:01:28.730 Distributions meaning the entire distribution of the values. 31 00:01:28.890 --> 00:01:31.110 So for instance, run time. 32 00:01:31.110 --> 00:01:35.730 I have these run time numbers of these different movies. 33 00:01:35.730 --> 00:01:37.650 What is the distribution of the run times? 34 00:01:37.650 --> 00:01:40.350 I bet most of them are clustered around two hours, 35 00:01:40.350 --> 00:01:42.450 90 minutes to two hours and 15 minutes. 36 00:01:42.450 --> 00:01:44.790 I'm just gonna throw numbers out there. I don't know. 37 00:01:44.790 --> 00:01:47.340 I bet very few or more than three hours, 38 00:01:47.340 --> 00:01:51.180 and I bet very few are under say, 80 minutes. 39 00:01:51.180 --> 00:01:52.290 I'm guessing. 40 00:01:52.290 --> 00:01:54.360 If I do a distribution, 41 00:01:54.360 --> 00:01:57.510 I can actually see how big that spread is, 42 00:01:57.510 --> 00:02:00.150 whether it's clustering at one end or the other, et cetera. 43 00:02:00.150 --> 00:02:02.850 It gives me a sense of the data overall, 44 00:02:02.850 --> 00:02:04.500 which is helpful on its own, 45 00:02:04.500 --> 00:02:06.360 but then when I analyze it and I find, 46 00:02:06.360 --> 00:02:10.680 ooh, this movie has a runtime of three hours and 47 minutes, 47 00:02:10.680 --> 00:02:12.750 I know if that's weird or not, right? 48 00:02:12.750 --> 00:02:15.660 So distributions are very, very, very powerful. 49 00:02:15.660 --> 00:02:18.510 I also then can do deeper dives into all kinds of stuff. 50 00:02:18.510 --> 00:02:20.910 Once I get my summary statistics, I do my distributions, 51 00:02:20.910 --> 00:02:24.480 I do other things, essentially it's those five F-U's, 52 00:02:24.480 --> 00:02:26.610 the Toyota 5 Whys of continuing to sort of dive 53 00:02:26.610 --> 00:02:28.500 deeper and deeper and deeper. 54 00:02:28.500 --> 00:02:33.500 Hopefully I discover insights that lead to new questions, 55 00:02:34.410 --> 00:02:36.960 which I can then look into. 56 00:02:36.960 --> 00:02:39.753 I can look into things like correlations also, right? 57 00:02:40.756 --> 00:02:44.430 Is there a correlation between runtime and movie quality? 58 00:02:44.430 --> 00:02:45.630 As an example. 59 00:02:45.630 --> 00:02:47.130 Does time play a role? 60 00:02:47.130 --> 00:02:48.780 Do the quality of movies change? 61 00:02:48.780 --> 00:02:51.990 Are the newer ones better or worse than the older ones? 62 00:02:51.990 --> 00:02:56.160 Are there any variances from the standards, right? 63 00:02:56.160 --> 00:02:59.430 Any particularly long movies or short movies 64 00:02:59.430 --> 00:03:00.870 over different time periods, 65 00:03:00.870 --> 00:03:02.703 et cetera, et cetera, et cetera. 66 00:03:03.540 --> 00:03:05.280 So, what we're gonna do next, 67 00:03:05.280 --> 00:03:09.960 another sort of thing I want to teach you, is pivot tables. 68 00:03:09.960 --> 00:03:14.490 Pivot tables are another really fantastic tool 69 00:03:14.490 --> 00:03:16.800 that you can play with in order to find 70 00:03:16.800 --> 00:03:18.300 insights in your data. 71 00:03:18.300 --> 00:03:20.340 Again, PivotTables are built into Excel, 72 00:03:20.340 --> 00:03:23.010 both on PC and Mac. 73 00:03:23.010 --> 00:03:24.480 I'm gonna be on my Mac now, 74 00:03:24.480 --> 00:03:25.830 so you'll see a slightly different interface 75 00:03:25.830 --> 00:03:27.420 than we looked at before. 76 00:03:27.420 --> 00:03:30.030 But pivot tables are a great way to bubble up insights 77 00:03:30.030 --> 00:03:31.350 in your data. 78 00:03:31.350 --> 00:03:32.910 And I have to start by saying 79 00:03:32.910 --> 00:03:35.310 there's a secret to pivot tables. 80 00:03:35.310 --> 00:03:37.440 Any of you who have ever tried 81 00:03:37.440 --> 00:03:38.970 to create pivot tables before, 82 00:03:38.970 --> 00:03:41.460 especially if you were trying to learn it on your own, 83 00:03:41.460 --> 00:03:42.840 were probably confused. 84 00:03:42.840 --> 00:03:44.617 I know when I first started using PivotTables, I was like, 85 00:03:44.617 --> 00:03:45.840 "What the heck's going on here?" 86 00:03:45.840 --> 00:03:47.553 It's very weird and confusing. 87 00:03:48.450 --> 00:03:50.400 But I took a course on LinkedIn learning, 88 00:03:50.400 --> 00:03:52.320 actually not one of my courses, somebody else's, 89 00:03:52.320 --> 00:03:54.630 and it taught me the secret. 90 00:03:54.630 --> 00:03:56.880 And the secret is amazing. 91 00:03:56.880 --> 00:04:00.300 The first thing to think about with the pivot table 92 00:04:00.300 --> 00:04:02.460 is what are you counting? 93 00:04:02.460 --> 00:04:04.203 What are you doing math with? 94 00:04:05.160 --> 00:04:09.150 That is your value. 95 00:04:09.150 --> 00:04:12.690 Once you figure out what goes in the values part 96 00:04:12.690 --> 00:04:15.660 of the pivot table, everything else is easy, 97 00:04:15.660 --> 00:04:17.310 but sometimes what you're counting 98 00:04:17.310 --> 00:04:20.310 isn't necessarily obvious. 99 00:04:20.310 --> 00:04:21.690 And I think we'll find an example of that 100 00:04:21.690 --> 00:04:23.460 when we play around with this stuff. 101 00:04:23.460 --> 00:04:26.790 So I'm gonna show you how to answer some questions 102 00:04:26.790 --> 00:04:30.330 in the IMDB dataset using PivotTables. 103 00:04:30.330 --> 00:04:33.120 Starting off with we're gonna try to figure out 104 00:04:33.120 --> 00:04:36.270 the total adjusted gross for the entire dataset. 105 00:04:36.270 --> 00:04:39.960 If we look at all top 1,000 movies, 106 00:04:39.960 --> 00:04:42.990 what is the total adjusted gross? 107 00:04:42.990 --> 00:04:44.580 So, here's the data set. 108 00:04:44.580 --> 00:04:46.080 And by the way, this is a data set 109 00:04:46.080 --> 00:04:47.820 that I've already cleaned up. 110 00:04:47.820 --> 00:04:49.410 And you'll notice that the cleaning that I did 111 00:04:49.410 --> 00:04:51.150 for this final version, which by the way, 112 00:04:51.150 --> 00:04:54.150 you all have access to, and as I mentioned before, 113 00:04:54.150 --> 00:04:58.953 undergrads, you can use this data set for your exercises. 114 00:04:59.880 --> 00:05:01.710 And so you'll notice a few things here. 115 00:05:01.710 --> 00:05:06.710 First of all, let me clear any sorting and filtering 116 00:05:07.110 --> 00:05:10.380 so that we can see all the data. 117 00:05:10.380 --> 00:05:13.860 So, you'll notice that there's an index column. 118 00:05:13.860 --> 00:05:17.310 I did use that feature in Power Query to create an index, 119 00:05:17.310 --> 00:05:20.460 but you'll notice I do only have one row per movie, 120 00:05:20.460 --> 00:05:22.110 because I actually ended up splitting the data 121 00:05:22.110 --> 00:05:23.940 into two different tables. 122 00:05:23.940 --> 00:05:27.900 So each movie in this primary table has, you know, 123 00:05:27.900 --> 00:05:31.380 an index number, just so I can refer back to it numerically, 124 00:05:31.380 --> 00:05:34.170 title of the movie, release year, runtime, IMDB rating, 125 00:05:34.170 --> 00:05:36.270 I kept the overview in there, I could have deleted it. 126 00:05:36.270 --> 00:05:41.270 Meta score, number of votes, gross CPI, adjusted gross, 127 00:05:42.240 --> 00:05:44.970 and I have a duplicate of my year column 128 00:05:44.970 --> 00:05:47.040 for reasons that I'll explain later. 129 00:05:47.040 --> 00:05:48.360 Don't worry about it for now. 130 00:05:48.360 --> 00:05:52.350 I created a separate table, I forgot, for each movie 131 00:05:52.350 --> 00:05:53.820 so I can see the index number. 132 00:05:53.820 --> 00:05:56.130 Let me zoom in on this so you can actually see it 133 00:05:56.130 --> 00:05:57.273 like normal humans. 134 00:05:58.290 --> 00:06:03.290 So movie zero corresponds with one of the movies 135 00:06:04.020 --> 00:06:05.820 in this other list here. 136 00:06:05.820 --> 00:06:10.820 So in fact, I can sort by the ID. 137 00:06:11.490 --> 00:06:16.490 And movie zero is this Indian "Dil Chahta Hai" movie. 138 00:06:16.710 --> 00:06:21.120 And so movies zero, that one, released in 2001, 139 00:06:21.120 --> 00:06:24.120 and I have some of the same data in here. 140 00:06:24.120 --> 00:06:27.660 Meta rating, IMDB rating, meta score, 141 00:06:27.660 --> 00:06:30.753 gross, director, different actors, 142 00:06:31.740 --> 00:06:34.230 and also CPI and the adjusted gross. 143 00:06:34.230 --> 00:06:36.090 The reason I duplicated some of those values here 144 00:06:36.090 --> 00:06:38.310 is cause I wanted to look at correlations 145 00:06:38.310 --> 00:06:40.710 between the people and some of these values. 146 00:06:40.710 --> 00:06:42.900 And so it was easier to do it in the same table. 147 00:06:42.900 --> 00:06:44.040 There are other ways to do that too, 148 00:06:44.040 --> 00:06:46.620 but it was just faster and easier to do it this way. 149 00:06:46.620 --> 00:06:49.353 Long story short, that's the data that I have. 150 00:06:50.520 --> 00:06:52.410 I'll show you this data set later on 151 00:06:52.410 --> 00:06:55.230 to explain the full analysis that I did, 152 00:06:55.230 --> 00:06:56.850 but let's just do some quick pivot tables. 153 00:06:56.850 --> 00:06:59.340 So like I said, what are we trying to do right now? 154 00:06:59.340 --> 00:07:03.150 We're trying to figure out the total adjusted gross. 155 00:07:03.150 --> 00:07:06.720 Now you could say, "Well, if we go to the main dataset here, 156 00:07:06.720 --> 00:07:09.450 and if I go to the adjusted gross column, 157 00:07:09.450 --> 00:07:11.340 I could simply write a formula, 158 00:07:11.340 --> 00:07:13.440 go all the way to the bottom, right, 159 00:07:13.440 --> 00:07:17.370 write a formula equals sum adjusted gross, 160 00:07:17.370 --> 00:07:18.870 and get the answer, right?" 161 00:07:18.870 --> 00:07:20.970 Or I can even just select the whole column 162 00:07:20.970 --> 00:07:22.230 and see the answer right here. 163 00:07:22.230 --> 00:07:27.230 The sum, 115 billion, 677 million, et cetera. 164 00:07:28.680 --> 00:07:30.330 That's true. I could just do it that way. 165 00:07:30.330 --> 00:07:31.980 That'd be a simple, fast way to do it. 166 00:07:31.980 --> 00:07:33.030 But I'm gonna use pivot tables to do it, 167 00:07:33.030 --> 00:07:34.260 'cause we're trying to learn pivot tables, right? 168 00:07:34.260 --> 00:07:37.290 So if I click anywhere inside of this table, 169 00:07:37.290 --> 00:07:38.123 doesn't matter where, 170 00:07:38.123 --> 00:07:39.750 I don't have to select the whole table, 171 00:07:39.750 --> 00:07:42.420 as long as I'm clicked inside of it, 172 00:07:42.420 --> 00:07:47.220 I can go to the insert tab up here and insert a pivot table. 173 00:07:47.220 --> 00:07:50.430 And I'm gonna create it in a new worksheet, 174 00:07:50.430 --> 00:07:54.270 and it's already selected the data that I want to work with. 175 00:07:54.270 --> 00:07:55.110 And so I'm gonna say, OK. 176 00:07:55.110 --> 00:07:56.160 So it creates a new worksheet 177 00:07:56.160 --> 00:07:59.310 with an empty pivot table in there. 178 00:07:59.310 --> 00:08:02.340 So once again, what am I trying to answer? 179 00:08:02.340 --> 00:08:05.610 I want the total adjusted gross. 180 00:08:05.610 --> 00:08:06.870 So what am I counting? 181 00:08:06.870 --> 00:08:08.400 What am I doing math with? 182 00:08:08.400 --> 00:08:11.670 Obviously the adjusted gross column in this case. 183 00:08:11.670 --> 00:08:13.800 So, I can go down here. 184 00:08:13.800 --> 00:08:16.560 The adjusted gross is one of my options. 185 00:08:16.560 --> 00:08:19.080 That's my value. 186 00:08:19.080 --> 00:08:22.680 So in the pivot table dialogue box over here, 187 00:08:22.680 --> 00:08:24.120 there are four different boxes, 188 00:08:24.120 --> 00:08:26.790 filters, columns, rows, and values. 189 00:08:26.790 --> 00:08:29.550 When I figure out what I'm counting, doing math with, 190 00:08:29.550 --> 00:08:30.990 that goes in the values area. 191 00:08:30.990 --> 00:08:33.030 Click and drag, and boom. 192 00:08:33.030 --> 00:08:36.240 Let me zoom in on this one also so you can see it. 193 00:08:36.240 --> 00:08:38.610 I get the sum of the adjusted gross 194 00:08:38.610 --> 00:08:40.950 and it's using scientific notation, 195 00:08:40.950 --> 00:08:45.930 but I can always change that so it displays it 196 00:08:45.930 --> 00:08:50.850 just as a number and I don't need any decimal places. 197 00:08:50.850 --> 00:08:54.300 Say OK, and I can see the number there. 198 00:08:54.300 --> 00:08:56.160 Now, maybe I'll want to change the formatting 199 00:08:56.160 --> 00:08:59.520 so it uses commas as a thousand separator, 200 00:08:59.520 --> 00:09:01.590 so I can actually read the number like a normal human. 201 00:09:01.590 --> 00:09:03.720 All right, 115 billion plus, 202 00:09:03.720 --> 00:09:06.150 which matches the number that I saw earlier. 203 00:09:06.150 --> 00:09:07.440 So in other words, I have a pivot table, 204 00:09:07.440 --> 00:09:09.090 it's a ridiculously simple pivot table 205 00:09:09.090 --> 00:09:11.190 that tells me the sum of the adjusted gross. 206 00:09:11.190 --> 00:09:12.570 Exactly the question. 207 00:09:12.570 --> 00:09:14.250 Which brings up an important point. 208 00:09:14.250 --> 00:09:17.460 Sometimes the phrasing of your questions 209 00:09:17.460 --> 00:09:20.460 will or won't lend itself to exactly 210 00:09:20.460 --> 00:09:23.160 what you see in your tools. 211 00:09:23.160 --> 00:09:25.320 I said I want the total adjusted gross, 212 00:09:25.320 --> 00:09:27.870 but what I'm doing in an Excel standpoint 213 00:09:27.870 --> 00:09:30.840 is a sum of the adjusted gross column. 214 00:09:30.840 --> 00:09:31.673 That's what we do, right? 215 00:09:31.673 --> 00:09:34.770 We ask questions, we figure out which columns, 216 00:09:34.770 --> 00:09:38.310 which variables apply to those questions, 217 00:09:38.310 --> 00:09:40.830 and we figure out the action, the math that we're doing. 218 00:09:40.830 --> 00:09:41.957 Are we adding it up? Are we doing an average? 219 00:09:41.957 --> 00:09:44.700 Are we doing this plus this equals that? 220 00:09:44.700 --> 00:09:46.530 Whatever the case may be. 221 00:09:46.530 --> 00:09:49.710 So, we've already done our first pivot table. 222 00:09:49.710 --> 00:09:51.330 So what else do I want to do? 223 00:09:51.330 --> 00:09:55.500 Next challenge is I want the share 224 00:09:55.500 --> 00:09:59.100 of adjusted gross by year. 225 00:09:59.100 --> 00:10:00.933 So what am I counting now? 226 00:10:01.830 --> 00:10:05.550 The share of adjusted gross by year. 227 00:10:05.550 --> 00:10:08.490 I think I'm still counting the adjusted gross, right? 228 00:10:08.490 --> 00:10:11.010 And I'm gonna translate some of the language here. 229 00:10:11.010 --> 00:10:13.140 What do I mean by this share of? 230 00:10:13.140 --> 00:10:15.300 I guess I mean the percentage, right? 231 00:10:15.300 --> 00:10:17.550 The percentage, the proportional share, 232 00:10:17.550 --> 00:10:21.900 those are the same idea, by year, organized by year. 233 00:10:21.900 --> 00:10:25.830 So it sounds like what I want to do is I want to organize, 234 00:10:25.830 --> 00:10:27.270 and I'm just going to keep building 235 00:10:27.270 --> 00:10:30.420 on the existing pivot table, no reason to replace it. 236 00:10:30.420 --> 00:10:32.730 So what I think I wanna do is I want to pull 237 00:10:32.730 --> 00:10:36.150 the released year and drop that into the rows. 238 00:10:36.150 --> 00:10:40.200 So I'm organizing the adjusted gross by year, 239 00:10:40.200 --> 00:10:42.960 exactly as phrased in the request, right? 240 00:10:42.960 --> 00:10:46.470 But as you'll see, this says the sum of the adjusted gross, 241 00:10:46.470 --> 00:10:50.160 right, this is where I started from the sum, the total, 242 00:10:50.160 --> 00:10:52.200 but the question says I want the share 243 00:10:52.200 --> 00:10:54.810 of the adjusted gross by year. 244 00:10:54.810 --> 00:10:57.570 And so I guess I do want the sum, the number, 245 00:10:57.570 --> 00:10:59.700 but the share of. 246 00:10:59.700 --> 00:11:03.450 So does this really answer my question the share? 247 00:11:03.450 --> 00:11:07.470 I guess 82 million, 15 million, whatever. 248 00:11:07.470 --> 00:11:11.070 I get the share, but usually, like I said before, 249 00:11:11.070 --> 00:11:13.347 I don't really know what 1.6 million, 250 00:11:13.347 --> 00:11:18.347 1.7 million is out of 115, or billion rather. 251 00:11:18.450 --> 00:11:21.810 And so share usually implies percentage. 252 00:11:21.810 --> 00:11:24.240 So how am I gonna show this as a percentage? 253 00:11:24.240 --> 00:11:27.210 Well, it's actually built into the pivot table. 254 00:11:27.210 --> 00:11:28.920 I can go over here with the values, 255 00:11:28.920 --> 00:11:31.260 and if I just click on this little I, 256 00:11:31.260 --> 00:11:33.210 I get this dialogue box. 257 00:11:33.210 --> 00:11:35.850 So PivotTables, all the action happens 258 00:11:35.850 --> 00:11:38.040 in like two or three or four different places. 259 00:11:38.040 --> 00:11:39.870 Number one is this area on the right hand side, 260 00:11:39.870 --> 00:11:42.720 where we drag things into those four boxes. 261 00:11:42.720 --> 00:11:44.250 This is another one, this dialogue box, 262 00:11:44.250 --> 00:11:47.310 the PivotTable Field Setting as it's called. 263 00:11:47.310 --> 00:11:50.460 So I can summarize by the sum, 264 00:11:50.460 --> 00:11:52.620 which is the default if it's a numeric column, 265 00:11:52.620 --> 00:11:55.200 as you can can see, that's what it did automatically. 266 00:11:55.200 --> 00:11:56.760 I can also get the count, the average, 267 00:11:56.760 --> 00:11:59.220 the minimum maximum, et cetera. 268 00:11:59.220 --> 00:12:02.463 I can also show data as, see this other tab over here. 269 00:12:03.300 --> 00:12:05.250 Right now it's not performing a calculation, 270 00:12:05.250 --> 00:12:07.950 but I can change this and have it do 271 00:12:07.950 --> 00:12:10.680 a percentage of grand total. 272 00:12:10.680 --> 00:12:13.590 If I do that and I say OK, 273 00:12:13.590 --> 00:12:16.890 look, now it's broken into percentages. 274 00:12:16.890 --> 00:12:21.300 So in our data set, 1920 represents 0% 275 00:12:21.300 --> 00:12:23.340 of the adjusted gross. 276 00:12:23.340 --> 00:12:26.397 1939 actually is like 3 1/2%, 277 00:12:26.397 --> 00:12:29.010 for some reason a lot of good stuff happened there. 278 00:12:29.010 --> 00:12:30.660 Or maybe one good movie, I'm not sure. 279 00:12:30.660 --> 00:12:33.090 That's a whole question I could ask the data, right? 280 00:12:33.090 --> 00:12:36.060 And more recent movies had a bigger impact 281 00:12:36.060 --> 00:12:39.030 on adjusted gross, but not universally, okay? 282 00:12:39.030 --> 00:12:42.000 So apparently I don't have any movies from 2020 283 00:12:42.000 --> 00:12:43.140 in the data set, maybe, 284 00:12:43.140 --> 00:12:46.500 or at least done with the gross factored in. 285 00:12:46.500 --> 00:12:50.220 Long story short, I think this answers that question, 286 00:12:50.220 --> 00:12:52.650 the share of adjusted gross by year. 287 00:12:52.650 --> 00:12:54.540 Let's look at the next question. 288 00:12:54.540 --> 00:12:58.290 The average adjusted gross by year. 289 00:12:58.290 --> 00:12:59.460 I just mentioned this one. 290 00:12:59.460 --> 00:13:01.020 In the pivot table, 291 00:13:01.020 --> 00:13:06.020 I can look at Summarize By, this tab in here, 292 00:13:06.240 --> 00:13:09.630 and instead of the sum, I can change it to average. 293 00:13:09.630 --> 00:13:14.630 And I can say, boom, now the average, but as a percentage. 294 00:13:15.870 --> 00:13:18.930 What's happening there? That's weird. 295 00:13:18.930 --> 00:13:23.930 1929 is 59%, 1925 is 30%. 296 00:13:24.060 --> 00:13:25.080 It looks like I'm getting, 297 00:13:25.080 --> 00:13:26.610 in fact, this one is 149%. 298 00:13:26.610 --> 00:13:28.260 What does this even mean? 299 00:13:28.260 --> 00:13:30.840 Well, if I'm taking an average number 300 00:13:30.840 --> 00:13:33.660 and dividing it by another average number, 301 00:13:33.660 --> 00:13:34.880 that's a weird way of looking at it. 302 00:13:34.880 --> 00:13:37.470 So I probably don't want to do 303 00:13:37.470 --> 00:13:39.900 the percentage calculation anymore. 304 00:13:39.900 --> 00:13:43.560 So let me turn that back to no calculation. 305 00:13:43.560 --> 00:13:48.240 And now I can see the average of adjusted gross as numbers. 306 00:13:48.240 --> 00:13:51.000 And yeah, now I can see it's just a number. 307 00:13:51.000 --> 00:13:54.420 The average adjusted gross for this year is some number. 308 00:13:54.420 --> 00:13:56.160 For another year, it's a different number, 309 00:13:56.160 --> 00:13:57.540 et cetera, et cetera. 310 00:13:57.540 --> 00:14:00.750 So sometimes percentages don't really make sense, 311 00:14:00.750 --> 00:14:03.810 other times they do, fair enough. 312 00:14:03.810 --> 00:14:07.350 How about the highest adjusted gross by year? 313 00:14:07.350 --> 00:14:10.980 Once again, just go back into the field value settings, 314 00:14:10.980 --> 00:14:14.040 instead of an average, look at the max. 315 00:14:14.040 --> 00:14:15.180 Easy. 316 00:14:15.180 --> 00:14:16.680 So you can see how this works. 317 00:14:16.680 --> 00:14:18.090 PivotTables make it very easy 318 00:14:18.090 --> 00:14:21.420 to quickly get some basic statistics 319 00:14:21.420 --> 00:14:23.310 and even organize those statistics, 320 00:14:23.310 --> 00:14:26.283 as we're doing here in this example by year. 321 00:14:27.150 --> 00:14:31.053 How about the most common run time? 322 00:14:32.670 --> 00:14:34.680 So this one's different. 323 00:14:34.680 --> 00:14:39.453 This one, the question is what are we counting? 324 00:14:40.560 --> 00:14:43.170 What are we doing math with? 325 00:14:43.170 --> 00:14:44.910 And again, if this is a live session, 326 00:14:44.910 --> 00:14:47.520 I would ask you all to answer this. 327 00:14:47.520 --> 00:14:49.860 And some of you would answer, "Well, I guess runtime, right? 328 00:14:49.860 --> 00:14:53.250 That's a numeric column. We must be counting that." 329 00:14:53.250 --> 00:14:54.810 And you know what the answer is? 330 00:14:54.810 --> 00:14:56.610 Nope. That's not what you're counting. 331 00:14:56.610 --> 00:14:58.920 That's not what we're doing math with. 332 00:14:58.920 --> 00:15:01.560 That's what we're organizing by. 333 00:15:01.560 --> 00:15:05.880 The question is what is the most common run time? 334 00:15:05.880 --> 00:15:09.750 Which run time happens most frequently? 335 00:15:09.750 --> 00:15:14.100 In other words, what we're counting in this case is movies. 336 00:15:14.100 --> 00:15:16.860 So, again, in data analytics it's really important 337 00:15:16.860 --> 00:15:20.130 to translate what you're asking into different words 338 00:15:20.130 --> 00:15:23.220 sometimes to understand what to do about it. 339 00:15:23.220 --> 00:15:26.130 So another way of saying what's the most common runtime 340 00:15:26.130 --> 00:15:31.130 is to ask the question, how many movies have each runtime? 341 00:15:33.240 --> 00:15:35.010 Right? How many movies are an hour long? 342 00:15:35.010 --> 00:15:36.360 How many movies are an hour and a half long? 343 00:15:36.360 --> 00:15:37.320 Et cetera, et cetera, et cetera. 344 00:15:37.320 --> 00:15:39.000 And then which one is the most common? 345 00:15:39.000 --> 00:15:41.850 So we're counting movies. 346 00:15:41.850 --> 00:15:44.430 So, I'm gonna leave this existing pivot table where it is. 347 00:15:44.430 --> 00:15:45.450 I'm gonna create a new pivot table. 348 00:15:45.450 --> 00:15:47.670 So I'm gonna go back to the source data. 349 00:15:47.670 --> 00:15:49.740 I'm gonna click into it. 350 00:15:49.740 --> 00:15:53.250 Oh, sorry, not the people one, I'm gonna go into this one. 351 00:15:53.250 --> 00:15:55.846 And I'm gonna say Insert, PivotTable. 352 00:15:55.846 --> 00:15:58.260 And I'm gonna drop it into an existing worksheet, 353 00:15:58.260 --> 00:16:00.030 but I'm gonna click into it so it drops it 354 00:16:00.030 --> 00:16:02.880 sort of next to this other one over here. 355 00:16:02.880 --> 00:16:04.200 And so it creates a new pivot table 356 00:16:04.200 --> 00:16:05.730 right next to the other one. 357 00:16:05.730 --> 00:16:08.130 And so I'm counting movies. 358 00:16:08.130 --> 00:16:10.800 So how about I count the index column? 359 00:16:10.800 --> 00:16:15.750 That's my unique numeric representation of different movies. 360 00:16:15.750 --> 00:16:17.580 Movie zero is that Indian movie, 361 00:16:17.580 --> 00:16:18.810 movie one is a different movie, 362 00:16:18.810 --> 00:16:20.400 et cetera, et cetera, et cetera. 363 00:16:20.400 --> 00:16:21.840 But look what it's doing. 364 00:16:21.840 --> 00:16:25.410 Because it's a numeric column, it's trying to add 'em up. 365 00:16:25.410 --> 00:16:28.800 The sum of the index values is a useless number. 366 00:16:28.800 --> 00:16:30.630 What I want is the count. 367 00:16:30.630 --> 00:16:34.320 So once again, field value settings, 368 00:16:34.320 --> 00:16:35.970 or field settings rather, 369 00:16:35.970 --> 00:16:38.580 instead of the sum, let me change it to account. 370 00:16:38.580 --> 00:16:43.580 And by the way, by default there's 999 movies, right? 371 00:16:43.710 --> 00:16:45.930 Because it's the IMDB top 1,000, and don't forget, 372 00:16:45.930 --> 00:16:50.490 we deleted "Apollo 13" cause it had that weird year. 373 00:16:50.490 --> 00:16:52.170 It had that year of PG, 374 00:16:52.170 --> 00:16:54.513 so I just deleted it for simplicity's sake. 375 00:16:55.710 --> 00:16:59.010 So this is the total number of movies in the data set. 376 00:16:59.010 --> 00:17:02.420 But what I want is the most common run time. 377 00:17:02.420 --> 00:17:04.950 So I guess I want to organize by run time. 378 00:17:04.950 --> 00:17:07.020 So lemme put the run time into the rows. 379 00:17:07.020 --> 00:17:08.220 Okay. 380 00:17:08.220 --> 00:17:10.560 Now I can see I have one movie 381 00:17:10.560 --> 00:17:12.480 that had a run time of 45 minutes. 382 00:17:12.480 --> 00:17:15.390 One movie with a runtime of 64 minutes, 383 00:17:15.390 --> 00:17:16.680 et cetera, et cetera. 384 00:17:16.680 --> 00:17:18.150 But what's the most common one? 385 00:17:18.150 --> 00:17:19.440 Well, this is a really long list. 386 00:17:19.440 --> 00:17:21.870 I could manually scroll through it, 387 00:17:21.870 --> 00:17:23.640 look for double digit numbers, 388 00:17:23.640 --> 00:17:25.050 and manually look for the biggest number, 389 00:17:25.050 --> 00:17:27.510 but you know there's a better way to do that, right? 390 00:17:27.510 --> 00:17:29.550 So maybe what I can do instead, 391 00:17:29.550 --> 00:17:33.630 is sort by this column of course. 392 00:17:33.630 --> 00:17:36.570 I can sort by clicking this little arrow over here, 393 00:17:36.570 --> 00:17:39.480 right at the top of this Row Labels column. 394 00:17:39.480 --> 00:17:44.310 And if I click that open, I can sort by not runtime, 395 00:17:44.310 --> 00:17:46.320 but by the count of the index. 396 00:17:46.320 --> 00:17:49.770 And I wanna sort in descending order. 397 00:17:49.770 --> 00:17:54.770 And boom, there are 23 movies with a runtime of 130 minutes. 398 00:17:57.150 --> 00:18:01.140 The most common runtime is 130 minutes. 399 00:18:01.140 --> 00:18:04.020 End of discussion. Tied with this one. 400 00:18:04.020 --> 00:18:07.740 There's also a lot of movies with a hundred minutes. 401 00:18:07.740 --> 00:18:09.300 And there's some that are close behind, 402 00:18:09.300 --> 00:18:10.830 and some others that are close behind that, 403 00:18:10.830 --> 00:18:12.600 et cetera, et cetera, et cetera. 404 00:18:12.600 --> 00:18:13.950 So long story short, 405 00:18:13.950 --> 00:18:17.073 we've answered the question with a nice simple pivot table. 406 00:18:18.060 --> 00:18:21.570 Now what about filtering? How do we filter our data? 407 00:18:21.570 --> 00:18:23.940 There are a whole bunch of different ways. 408 00:18:23.940 --> 00:18:26.760 So we're gonna look at the average adjusted gross 409 00:18:26.760 --> 00:18:28.680 by year pivot table. 410 00:18:28.680 --> 00:18:29.790 Do I still have that one? 411 00:18:29.790 --> 00:18:32.400 This is a max. Let me go back and change that to an average. 412 00:18:32.400 --> 00:18:34.980 'Cause that's what we said we're gonna look at. 413 00:18:34.980 --> 00:18:37.023 Average of adjusted gross by year. 414 00:18:37.980 --> 00:18:40.830 And I want to add a filter to it to find the average 415 00:18:40.830 --> 00:18:45.540 adjusted gross for movies released in the 2000s. 416 00:18:45.540 --> 00:18:47.450 Okay, so 2000 to 2009. 417 00:18:47.450 --> 00:18:48.330 So how can I do that? 418 00:18:48.330 --> 00:18:49.350 There's a whole bunch of different ways. 419 00:18:49.350 --> 00:18:53.580 One way, see this box over here, Filters? 420 00:18:53.580 --> 00:18:56.190 So I'm gonna drop the Dupe year into the filters. 421 00:18:56.190 --> 00:18:58.170 This is one of the reasons I wanted this extra column, 422 00:18:58.170 --> 00:19:00.450 'cause I can't have the year in both the rows 423 00:19:00.450 --> 00:19:03.390 and the filters, for whatever weird reason. 424 00:19:03.390 --> 00:19:07.260 So when I add it to the filter in the pivot table, 425 00:19:07.260 --> 00:19:10.290 it added this little stuff up here in row one. 426 00:19:10.290 --> 00:19:12.300 And so it shows me the name of the column, 427 00:19:12.300 --> 00:19:14.490 and then it shows me this little dropdown, 428 00:19:14.490 --> 00:19:17.040 and by default all of them are selected. 429 00:19:17.040 --> 00:19:20.430 But I could just go in here and deselect all, 430 00:19:20.430 --> 00:19:24.030 and then just manually select the 2000s. 431 00:19:24.030 --> 00:19:26.733 So 2000, 2001, 2, 3, 432 00:19:27.960 --> 00:19:32.133 4, 5, 6, 7, 8, and 9. 433 00:19:33.000 --> 00:19:34.413 And so the question. 434 00:19:35.880 --> 00:19:37.470 And so the question is answered, right? 435 00:19:37.470 --> 00:19:41.940 The question was looking for the average adjusted gross 436 00:19:41.940 --> 00:19:44.820 for the 2000s and there it is, 437 00:19:44.820 --> 00:19:48.840 the average adjusted gross for the 2000s. 438 00:19:48.840 --> 00:19:50.790 And of course I can see each individual year 439 00:19:50.790 --> 00:19:53.070 if I do it this way as well. 440 00:19:53.070 --> 00:19:55.140 So fine, that's nice. 441 00:19:55.140 --> 00:19:56.580 That's one way to do it. 442 00:19:56.580 --> 00:19:59.073 Adding the filter in the pivot fields. 443 00:20:00.000 --> 00:20:04.350 I can also filter using the dropdown itself. 444 00:20:04.350 --> 00:20:05.547 So let's say I get rid of this filter over here. 445 00:20:05.547 --> 00:20:06.380 And by the way, 446 00:20:06.380 --> 00:20:08.370 anything in the pivot table you want to get rid of, 447 00:20:08.370 --> 00:20:12.810 you can just click it and drag it off to the side, 448 00:20:12.810 --> 00:20:15.150 and let go of it, and poof it disappears. 449 00:20:15.150 --> 00:20:18.180 So you can see now it's not filtered by year anymore. 450 00:20:18.180 --> 00:20:20.760 The nice thing about PivotTables, 451 00:20:20.760 --> 00:20:22.470 is I can filter right in here. 452 00:20:22.470 --> 00:20:24.930 See this little dropdown where it says Row Labels? 453 00:20:24.930 --> 00:20:26.940 Drop that open and I have the same filter. 454 00:20:26.940 --> 00:20:29.460 So once again I can just deselect 455 00:20:29.460 --> 00:20:32.430 and then manually select just the ones that I want. 456 00:20:32.430 --> 00:20:35.280 2000, whoops, 1, 2, 3, et cetera. 457 00:20:35.280 --> 00:20:38.040 I'm not gonna do all of it, 'cause it just takes time. 458 00:20:38.040 --> 00:20:40.500 So that's nice. I can do it that way. 459 00:20:40.500 --> 00:20:42.240 There's another way to do it, 460 00:20:42.240 --> 00:20:46.230 which is I can actually use the between functionality, 461 00:20:46.230 --> 00:20:47.670 which is kind of of sweet. 462 00:20:47.670 --> 00:20:49.323 Instead of manually selecting, 463 00:20:50.160 --> 00:20:55.080 what if I say I want to filter by label or value? 464 00:20:55.080 --> 00:20:55.913 See these two things? 465 00:20:55.913 --> 00:20:57.720 Now this gets a little weird. 466 00:20:57.720 --> 00:21:01.440 By label means by the row labels, 467 00:21:01.440 --> 00:21:05.700 by value means it'll filter by the adjusted gross column. 468 00:21:05.700 --> 00:21:06.690 So keep that in mind. 469 00:21:06.690 --> 00:21:08.550 I can actually filter from this little dropdown, 470 00:21:08.550 --> 00:21:10.140 either one of those columns. 471 00:21:10.140 --> 00:21:13.260 So I wanna filter by label, the actual row labels, 472 00:21:13.260 --> 00:21:15.220 and I want to filter between 473 00:21:16.859 --> 00:21:20.310 2000 and 2009. 474 00:21:20.310 --> 00:21:22.560 And I say OK, and boom. 475 00:21:22.560 --> 00:21:24.150 Once again it did the same thing. 476 00:21:24.150 --> 00:21:29.150 So I can manually select, I can group, I can do between. 477 00:21:29.280 --> 00:21:30.870 And as you saw in there, by the way, 478 00:21:30.870 --> 00:21:32.790 I probably showed it too quickly. 479 00:21:32.790 --> 00:21:34.560 Between isn't the only option. 480 00:21:34.560 --> 00:21:37.350 I can filter less than, greater than, less than or equal to, 481 00:21:37.350 --> 00:21:40.530 contains, like essentially a word search, 482 00:21:40.530 --> 00:21:42.630 begins with, all kinds of different ways 483 00:21:42.630 --> 00:21:45.993 of actually doing smarter and faster filtering. 484 00:21:47.220 --> 00:21:48.210 Last but not least, 485 00:21:48.210 --> 00:21:51.123 and I'm gonna clear this filter just to show it to you, 486 00:21:52.350 --> 00:21:53.670 and maybe this isn't last but not least, 487 00:21:53.670 --> 00:21:55.440 but certainly one of the last ones, 488 00:21:55.440 --> 00:21:58.140 is I can filter, or not exactly filter, 489 00:21:58.140 --> 00:22:00.900 but I can essentially use what's called group buy. 490 00:22:00.900 --> 00:22:04.953 In this PivotTable Analyze tab, 491 00:22:05.970 --> 00:22:08.580 there's this group selection button. 492 00:22:08.580 --> 00:22:11.070 And if I have a year down here selected, 493 00:22:11.070 --> 00:22:16.070 so the active component of the pivot table is the years, 494 00:22:16.710 --> 00:22:20.550 and I click Group Selection and actually select it, 495 00:22:20.550 --> 00:22:21.990 it's gonna group and by default 496 00:22:21.990 --> 00:22:24.480 it's gonna say starting at the earliest year 497 00:22:24.480 --> 00:22:26.220 and ending at the most recent year. 498 00:22:26.220 --> 00:22:29.460 And it's gonna group, by default, by 10. 499 00:22:29.460 --> 00:22:30.870 So if I just leave the defaults in place, 500 00:22:30.870 --> 00:22:32.010 look what happened. 501 00:22:32.010 --> 00:22:35.100 It's grouping by 1920 to 29, 30 to 39, 502 00:22:35.100 --> 00:22:36.480 et cetera, et cetera, et cetera. 503 00:22:36.480 --> 00:22:40.200 And so the question was by the thousands what's the average? 504 00:22:40.200 --> 00:22:42.090 Boom, there's the number. 505 00:22:42.090 --> 00:22:44.490 So, this is kind of nice, 506 00:22:44.490 --> 00:22:46.980 But what if I wanted to compare the thousands, 507 00:22:46.980 --> 00:22:51.120 the 2000s rather, to all the movies before 508 00:22:51.120 --> 00:22:52.650 and all the movies after? 509 00:22:52.650 --> 00:22:55.080 Now I have to like manually add these ones together. 510 00:22:55.080 --> 00:22:56.610 That's kind of a pain. 511 00:22:56.610 --> 00:22:57.720 Well the nice thing is that Group By 512 00:22:57.720 --> 00:22:59.280 is a little bit smarter than that. 513 00:22:59.280 --> 00:23:01.860 I can go back in here, Group Selection, 514 00:23:01.860 --> 00:23:04.170 instead of starting at 1920, 515 00:23:04.170 --> 00:23:08.460 if I say start a 2000 and end at 2009. 516 00:23:10.110 --> 00:23:12.600 So I'm literally saying just group by that one section, 517 00:23:12.600 --> 00:23:13.500 look what it does. 518 00:23:14.340 --> 00:23:16.800 It actually said, "Okay, this is the one you care about, 519 00:23:16.800 --> 00:23:18.210 but there's earlier stuff here. 520 00:23:18.210 --> 00:23:19.043 What should we do with those? 521 00:23:19.043 --> 00:23:20.220 I guess we should group it all together 522 00:23:20.220 --> 00:23:21.300 and give you that number. 523 00:23:21.300 --> 00:23:22.620 And what about the stuff after that? 524 00:23:22.620 --> 00:23:25.260 We'll group all those together and give you that number." 525 00:23:25.260 --> 00:23:28.890 So group I is a very powerful and useful feature, 526 00:23:28.890 --> 00:23:31.320 definitely worth getting used to using. 527 00:23:31.320 --> 00:23:34.200 So there are different ways to do stuff in PivotTables 528 00:23:34.200 --> 00:23:37.740 and it's super fast and easy, click and drag, et cetera. 529 00:23:37.740 --> 00:23:39.180 Three more things I want to talk about. 530 00:23:39.180 --> 00:23:41.280 One is dates. 531 00:23:41.280 --> 00:23:44.010 Now, in this particular data set, we don't have dates. 532 00:23:44.010 --> 00:23:46.410 All we have is like the year. 533 00:23:46.410 --> 00:23:49.293 So I can't demonstrate this, but suffice to say this, 534 00:23:50.430 --> 00:23:53.910 if I'm in a pivot table, and we just did it, 535 00:23:53.910 --> 00:23:57.720 and if I had an actual fully defined date in here, 536 00:23:57.720 --> 00:24:00.900 like January 3rd, you know, 1992, 537 00:24:00.900 --> 00:24:04.200 and that was what all the values were in the column. 538 00:24:04.200 --> 00:24:07.260 if I dragged that into the rows as an example, 539 00:24:07.260 --> 00:24:11.850 it would automatically, not only show me by date, 540 00:24:11.850 --> 00:24:14.850 but I could even look at the values by year, 541 00:24:14.850 --> 00:24:18.000 by quarter, and by month. 542 00:24:18.000 --> 00:24:21.480 So years are treated very intelligently 543 00:24:21.480 --> 00:24:24.600 in Microsoft software generally, PivotTables specifically. 544 00:24:24.600 --> 00:24:27.240 So, check that out, play with that. 545 00:24:27.240 --> 00:24:30.630 There's also things like the top X. 546 00:24:30.630 --> 00:24:35.310 Let's say. Let me clear this grouping over here. 547 00:24:35.310 --> 00:24:37.830 Let me go back to PivotTable Analyze group, 548 00:24:37.830 --> 00:24:39.130 and I'm gonna ungroup. 549 00:24:40.140 --> 00:24:43.410 And let's say that what I wanted to do 550 00:24:43.410 --> 00:24:47.223 is to look at the top X. (coughs) Excuse me. 551 00:24:48.150 --> 00:24:53.150 What I can go in here and actually, by value in this case, 552 00:24:53.580 --> 00:24:58.380 I'm gonna say I wanna look at the top X of adjusted gross. 553 00:24:58.380 --> 00:25:01.260 So I'm gonna go into By Value, adjusted gross. 554 00:25:01.260 --> 00:25:03.900 Top 10, but really it's the top X, 555 00:25:03.900 --> 00:25:08.220 and I wanna look at the top 10 items. 556 00:25:08.220 --> 00:25:10.050 So in other words, it'll automatically give me 557 00:25:10.050 --> 00:25:15.050 the 10 years that have the highest average adjusted gross. 558 00:25:15.210 --> 00:25:17.790 So this is 10 items, the 10 years, 559 00:25:17.790 --> 00:25:19.230 'cause that's the organizing principle here, 560 00:25:19.230 --> 00:25:20.670 by adjusted gross. 561 00:25:20.670 --> 00:25:23.730 I can also change it to the top percentage. 562 00:25:23.730 --> 00:25:27.780 What are the top, the 10% of adjusted gross 563 00:25:27.780 --> 00:25:30.330 you can account for in just these two years? 564 00:25:30.330 --> 00:25:31.380 That's interesting. 565 00:25:31.380 --> 00:25:33.330 So different ways of looking at your data, 566 00:25:33.330 --> 00:25:36.210 and a bunch of different sort of functions, 567 00:25:36.210 --> 00:25:40.500 all using built-in functionality in PivotTables. 568 00:25:40.500 --> 00:25:43.140 Super cool, really powerful. 569 00:25:43.140 --> 00:25:46.140 Last but not least, I do wanna show one other thing. 570 00:25:46.140 --> 00:25:51.030 There's also the ability to create calculated fields. 571 00:25:51.030 --> 00:25:53.040 So let's say I had a pivot table like this one, 572 00:25:53.040 --> 00:25:55.500 the number of movies and the runtime. 573 00:25:55.500 --> 00:25:58.170 And let's say I had, I wanted to create a calculated field, 574 00:25:58.170 --> 00:26:00.843 a new field that somehow did some weird math. 575 00:26:01.830 --> 00:26:05.970 For instance the runtime, I'm gonna make something up here. 576 00:26:05.970 --> 00:26:09.150 The runtime, I wanna calculate the number of seconds 577 00:26:09.150 --> 00:26:11.700 for runtime for some weird reason. 578 00:26:11.700 --> 00:26:12.840 I can create a new field. 579 00:26:12.840 --> 00:26:15.150 So I go into PivotTable Analyze, 580 00:26:15.150 --> 00:26:19.050 and I go into Fields, Items, & Sets, 581 00:26:19.050 --> 00:26:21.840 and I click on Calculated Field. 582 00:26:21.840 --> 00:26:24.600 And this is sort of like we added a column 583 00:26:24.600 --> 00:26:26.940 in Power BI when we were doing data cleaning. 584 00:26:26.940 --> 00:26:29.400 I can create a calculated field and I can name it. 585 00:26:29.400 --> 00:26:34.400 So I'm gonna call it run seconds, for whatever reason. 586 00:26:34.650 --> 00:26:37.050 And the formula is going to be 587 00:26:37.050 --> 00:26:41.310 take the run time 588 00:26:41.310 --> 00:26:45.090 and multiply it by 60. 589 00:26:45.090 --> 00:26:46.020 'Cause right now it's in minutes. 590 00:26:46.020 --> 00:26:48.540 To turn minutes into seconds you multiply by 60. 591 00:26:48.540 --> 00:26:51.270 And I say, OK, and look, 592 00:26:51.270 --> 00:26:53.910 it creates run seconds, right? 593 00:26:53.910 --> 00:26:57.960 So it actually turned, you know, the minutes into seconds 594 00:26:57.960 --> 00:26:59.520 and I can see that number there. 595 00:26:59.520 --> 00:27:02.280 So, what's cool about this too is that that field 596 00:27:02.280 --> 00:27:05.430 then is in this PivotTable Fields box, 597 00:27:05.430 --> 00:27:08.280 so I can use this field in other ways 598 00:27:08.280 --> 00:27:09.430 within the pivot table. 599 00:27:10.290 --> 00:27:12.360 Powerful stuff. Definitely worth checking out. 600 00:27:12.360 --> 00:27:15.000 There's more stuff you can do in PivotTables too, 601 00:27:15.000 --> 00:27:16.410 but definitely use something you should play with 602 00:27:16.410 --> 00:27:17.340 and get comfortable doing. 603 00:27:17.340 --> 00:27:19.530 You'll be able to do data analytics much more quickly 604 00:27:19.530 --> 00:27:21.330 and easily with PivotTables. 605 00:27:21.330 --> 00:27:23.493 All right, don't forget, 606 00:27:24.840 --> 00:27:29.580 data analysis is about asking questions of your data. 607 00:27:29.580 --> 00:27:31.950 It's not about jumping in and starting analyzing stuff. 608 00:27:31.950 --> 00:27:34.500 Why are you looking at the numbers that way? 609 00:27:34.500 --> 00:27:36.540 What question are you trying to answer? 610 00:27:36.540 --> 00:27:38.400 And it's also about, like I keep saying, 611 00:27:38.400 --> 00:27:41.520 being able to translate those questions into fields, 612 00:27:41.520 --> 00:27:45.690 metrics, values that you have and the words 613 00:27:45.690 --> 00:27:47.520 that help you understand what you're really trying to do. 614 00:27:47.520 --> 00:27:50.970 Share. Ah, you mean percentage, right? 615 00:27:50.970 --> 00:27:55.050 The average. Okay, I want the true average or the typical. 616 00:27:55.050 --> 00:27:56.430 Ooh, how do we define typical? 617 00:27:56.430 --> 00:27:57.900 Could be the average, could be the median, 618 00:27:57.900 --> 00:27:59.640 it could be the mode, right? 619 00:27:59.640 --> 00:28:03.030 So translation is a big, big, big part of this. 620 00:28:03.030 --> 00:28:04.650 And like I said before, sometimes you're forming 621 00:28:04.650 --> 00:28:08.940 explicit hypotheses which you're testing, but not always. 622 00:28:08.940 --> 00:28:11.520 Very important in data analytics. 623 00:28:11.520 --> 00:28:14.460 Sometimes you are neck deep in the data, 624 00:28:14.460 --> 00:28:18.300 you're so focused on the data that you forget 625 00:28:18.300 --> 00:28:19.980 to step back and ask those questions. 626 00:28:19.980 --> 00:28:21.420 Really ask yourself, what am I trying to do? 627 00:28:21.420 --> 00:28:23.130 Why am I trying to look at it this way? 628 00:28:23.130 --> 00:28:24.960 You don't just explore data to explore data, 629 00:28:24.960 --> 00:28:27.450 you're trying to really answer specific questions. 630 00:28:27.450 --> 00:28:30.690 Sometimes you are exploring to find where the insights are, 631 00:28:30.690 --> 00:28:32.910 but even then you're essentially answering questions 632 00:28:32.910 --> 00:28:34.740 even if you're not defining it that way. 633 00:28:34.740 --> 00:28:35.880 And don't forget, there are a million and one 634 00:28:35.880 --> 00:28:37.590 questions to ask of your data, 635 00:28:37.590 --> 00:28:38.790 but some of the key ones 636 00:28:38.790 --> 00:28:40.350 are the ones that we talked about earlier. 637 00:28:40.350 --> 00:28:42.840 What's good or bad? Should I be looking at the rate? 638 00:28:42.840 --> 00:28:45.930 Et cetera, et cetera, et cetera. 639 00:28:45.930 --> 00:28:50.373 So, what did I learn from the IMDB data? 640 00:28:51.690 --> 00:28:52.860 Well here's the data set, 641 00:28:52.860 --> 00:28:54.300 and I just wanna point out a couple things 642 00:28:54.300 --> 00:28:55.980 before I even talk about what I found, 643 00:28:55.980 --> 00:29:00.980 which is, you'll notice that I have the IMDB top 1,000 data 644 00:29:01.140 --> 00:29:04.020 that you can see pulled in from Power Query. 645 00:29:04.020 --> 00:29:06.120 And you'll notice I have one row per movie. 646 00:29:06.120 --> 00:29:08.250 I ended up splitting up the data in a different way 647 00:29:08.250 --> 00:29:12.720 than as I demonstrated in Power Query earlier. 648 00:29:12.720 --> 00:29:14.820 I have one row per movie, I have the index. 649 00:29:14.820 --> 00:29:17.130 So I actually did end up adding an index column. 650 00:29:17.130 --> 00:29:19.980 So now each movie has a number associated with it, 651 00:29:19.980 --> 00:29:21.990 and you'll see how I use that later. 652 00:29:21.990 --> 00:29:25.080 I have the other fields released year, runtime, 653 00:29:25.080 --> 00:29:26.760 IMDB rating, overview. 654 00:29:26.760 --> 00:29:28.230 I didn't bother deleting that. 655 00:29:28.230 --> 00:29:32.280 Meta score, votes, gross, CPI, adjusted gross. 656 00:29:32.280 --> 00:29:35.130 And I have a column in here called Dupe Year, 657 00:29:35.130 --> 00:29:37.530 which you'll see how I use that later on. 658 00:29:37.530 --> 00:29:39.060 I also have the CPI table, again, 659 00:29:39.060 --> 00:29:41.520 because it was in Power Query so it got pulled in. 660 00:29:41.520 --> 00:29:45.240 And then I have these two relational tables, 661 00:29:45.240 --> 00:29:47.760 which are essentially where I pulled out, 662 00:29:47.760 --> 00:29:52.760 in this case the people, so the actors and directors, 663 00:29:52.920 --> 00:29:56.610 and each of them is associated with one movie. 664 00:29:56.610 --> 00:30:00.750 So movie zero is movie number zero. 665 00:30:00.750 --> 00:30:04.770 And so if I sort this table by the index column. 666 00:30:04.770 --> 00:30:06.810 Let me just show you that. 667 00:30:06.810 --> 00:30:10.717 Movie zero is that Bollywood movie that we saw earlier, 668 00:30:10.717 --> 00:30:12.930 "Dil Chahta Hai." 669 00:30:12.930 --> 00:30:15.630 And so these people in this other table 670 00:30:15.630 --> 00:30:19.410 are the four actors and the director for that movie. 671 00:30:19.410 --> 00:30:21.390 You'll also notice in this people table 672 00:30:21.390 --> 00:30:22.410 I have some of the other fields, 673 00:30:22.410 --> 00:30:25.740 the release year, IMDB rating, et cetera. 674 00:30:25.740 --> 00:30:29.340 The reason is because I wanted to do like correlations 675 00:30:29.340 --> 00:30:31.140 between people and some of this data 676 00:30:31.140 --> 00:30:33.930 and it was easier to have it in the table 677 00:30:33.930 --> 00:30:35.040 rather than have it separately. 678 00:30:35.040 --> 00:30:36.720 You could do it multiple ways, 679 00:30:36.720 --> 00:30:38.670 it was easier just to pull those fields in here 680 00:30:38.670 --> 00:30:40.380 and sort of duplicate 'em. 681 00:30:40.380 --> 00:30:41.670 Long story short, I also have 682 00:30:41.670 --> 00:30:43.620 a separate table for my genres. 683 00:30:43.620 --> 00:30:47.010 I know when we're in Power Query, I essentially demonstrated 684 00:30:47.010 --> 00:30:49.290 a way of unpivoting those columns. 685 00:30:49.290 --> 00:30:52.290 I did unpivot them, but in a slightly different way. 686 00:30:52.290 --> 00:30:54.990 I unpivoted them into this table, 687 00:30:54.990 --> 00:30:57.810 where I once again have the index column 688 00:30:57.810 --> 00:30:59.220 and then some of those other fields again. 689 00:30:59.220 --> 00:31:03.720 But then I have, again, our genres in our multiple rows. 690 00:31:03.720 --> 00:31:05.820 So similar idea, but the data does look different 691 00:31:05.820 --> 00:31:07.350 from what I showed you earlier. 692 00:31:07.350 --> 00:31:09.480 Okay, so what did I find in the data, was the question. 693 00:31:09.480 --> 00:31:11.070 And the first thing I did 694 00:31:11.070 --> 00:31:13.260 is what I've recommended you think about doing, 695 00:31:13.260 --> 00:31:16.200 which is I ran a bunch of summary statistics. 696 00:31:16.200 --> 00:31:21.097 So I looked at things like the runtime average, 697 00:31:21.097 --> 00:31:26.040 IMDB rating, meta score, votes, adjusted gross, 698 00:31:26.040 --> 00:31:27.840 all of these being average numbers. 699 00:31:27.840 --> 00:31:30.360 I also did like min/max for some of these things, 700 00:31:30.360 --> 00:31:33.030 and maybe even all of these things, I can't remember, 701 00:31:33.030 --> 00:31:34.140 playing around with a lot of this stuff. 702 00:31:34.140 --> 00:31:39.140 And, as you'll notice, my rows here are decades. 703 00:31:39.960 --> 00:31:43.230 And you're seeing this after I did a bunch of work 704 00:31:43.230 --> 00:31:44.250 and sort of played around with things. 705 00:31:44.250 --> 00:31:46.260 Eventually I was looking at it I think over time, 706 00:31:46.260 --> 00:31:48.420 generally year by year. 707 00:31:48.420 --> 00:31:51.090 And I decided at one point to look at it by decade. 708 00:31:51.090 --> 00:31:53.580 And that was sort of one of my first ah-has, 709 00:31:53.580 --> 00:31:55.500 one of my first insights, 710 00:31:55.500 --> 00:32:00.500 was that the adjusted gross revenues for the early 2000s, 711 00:32:02.310 --> 00:32:05.220 not the early 2000s, but 2000 to 2009, 712 00:32:05.220 --> 00:32:08.760 was a much lower number than for all of the other years. 713 00:32:08.760 --> 00:32:10.740 So that became sort of the spark 714 00:32:10.740 --> 00:32:14.250 that guided me through a bunch of the rest of my analysis. 715 00:32:14.250 --> 00:32:15.960 Not everything, 716 00:32:15.960 --> 00:32:17.790 but you'll notice I have this little hypothesis up here. 717 00:32:17.790 --> 00:32:20.190 Something strange going on with the movies in that era. 718 00:32:20.190 --> 00:32:21.300 Was it the recessions? 719 00:32:21.300 --> 00:32:23.280 Or maybe people were on the webs? 720 00:32:23.280 --> 00:32:25.230 It was early web years instead of going to the movies. 721 00:32:25.230 --> 00:32:28.050 Or something else? Dot, dot, dot question mark. 722 00:32:28.050 --> 00:32:29.610 I never figured out what it was. 723 00:32:29.610 --> 00:32:30.930 I didn't do the full analysis here. 724 00:32:30.930 --> 00:32:33.630 I started the analysis, 'cause you all can play with this 725 00:32:33.630 --> 00:32:35.880 and run with the analysis that I've done, 726 00:32:35.880 --> 00:32:37.530 or take it further. 727 00:32:37.530 --> 00:32:38.400 It's entirely up to you. 728 00:32:38.400 --> 00:32:40.350 And by the way, when when I say you all, 729 00:32:40.350 --> 00:32:43.680 undergrads, again, are able to use this for your project, 730 00:32:43.680 --> 00:32:47.370 graduate students need to use your own research. 731 00:32:47.370 --> 00:32:52.370 Okay, so summary statistics, min/max, number of votes. 732 00:32:52.380 --> 00:32:55.170 I noticed that there were a lot more IMDB votes 733 00:32:55.170 --> 00:32:59.250 for that same time period, but again, not sure why. 734 00:32:59.250 --> 00:33:03.180 Sometimes you'll notice I broke out just that decade 735 00:33:03.180 --> 00:33:05.640 year by year to see if any interesting patterns 736 00:33:05.640 --> 00:33:06.810 popped out at me because hey, 737 00:33:06.810 --> 00:33:08.820 I'm noticing something happening. 738 00:33:08.820 --> 00:33:10.230 Maybe if I look at that decade, 739 00:33:10.230 --> 00:33:11.640 maybe I'll find some anomalies. 740 00:33:11.640 --> 00:33:16.260 For instance, many fewer votes during this one year, 2003, 741 00:33:16.260 --> 00:33:19.320 and, by the way, some really high number of votes 742 00:33:19.320 --> 00:33:21.060 a few years after that. 743 00:33:21.060 --> 00:33:21.893 What does all this mean? 744 00:33:21.893 --> 00:33:23.670 I have no idea, but I'm just noticing things 745 00:33:23.670 --> 00:33:25.830 when I'm running summary statistics. 746 00:33:25.830 --> 00:33:30.240 Min/max, meta score, also much lower minimum meta score 747 00:33:30.240 --> 00:33:31.530 for that decade. 748 00:33:31.530 --> 00:33:33.780 The highest one doesn't even hit a hundred for some reason. 749 00:33:33.780 --> 00:33:34.710 I dunno if that's significant, 750 00:33:34.710 --> 00:33:36.570 but this certainly feels significant. 751 00:33:36.570 --> 00:33:38.310 Again, looked at it by those years 752 00:33:38.310 --> 00:33:39.660 to see if anything popped out. 753 00:33:39.660 --> 00:33:41.370 And sure there was some at the bottom of the range, 754 00:33:41.370 --> 00:33:42.930 some at the top of the range. 755 00:33:42.930 --> 00:33:45.330 You know, nothing too earth shattering here. 756 00:33:45.330 --> 00:33:47.490 I'm not gonna explain every finding here, 757 00:33:47.490 --> 00:33:49.320 but feel free to poke around this. 758 00:33:49.320 --> 00:33:51.750 You'll also notice that I did do some visualizations. 759 00:33:51.750 --> 00:33:54.630 In this case, just a couple of box and whisker plots 760 00:33:54.630 --> 00:33:57.690 to look at if we're looking at all of the votes cast, 761 00:33:57.690 --> 00:33:59.160 counting how many votes were cast 762 00:33:59.160 --> 00:34:01.590 for all the movies in the data set, 763 00:34:01.590 --> 00:34:03.690 the median is a fairly low number, 764 00:34:03.690 --> 00:34:05.790 a hundred thousand maybe ish. 765 00:34:05.790 --> 00:34:08.700 And then you know the mid quartile range 766 00:34:08.700 --> 00:34:12.690 is a decent but fairly small spread all below 500,000, 767 00:34:12.690 --> 00:34:14.670 and then some real big outliers. 768 00:34:14.670 --> 00:34:15.990 Adjusted gross, same thing. 769 00:34:15.990 --> 00:34:19.680 Most movies, the median movie makes very little money, 770 00:34:19.680 --> 00:34:22.740 and then some outliers way higher than that, 771 00:34:22.740 --> 00:34:24.480 especially in terms of adjusted gross, 772 00:34:24.480 --> 00:34:26.160 adjusted for inflation. 773 00:34:26.160 --> 00:34:27.540 So that's my summary statistics. 774 00:34:27.540 --> 00:34:29.880 That sort of get led me to my first hypothesis 775 00:34:29.880 --> 00:34:31.860 that I really wanted to explore more. 776 00:34:31.860 --> 00:34:35.490 So I also have this other tab here called 2000s. 777 00:34:35.490 --> 00:34:39.570 And in this tab I just explored that decade for a while 778 00:34:39.570 --> 00:34:41.700 and I looked at it from a bunch of different directions. 779 00:34:41.700 --> 00:34:44.100 And again, I'm not gonna tell you everything that I found, 780 00:34:44.100 --> 00:34:47.040 you can poke around this file and look at it yourself. 781 00:34:47.040 --> 00:34:49.620 One thing I'll point out is I copied and pasted 782 00:34:49.620 --> 00:34:52.530 some of the data over here to make it easier 783 00:34:52.530 --> 00:34:54.693 to generate things like scatter plots. 784 00:34:56.040 --> 00:34:57.540 You know, I'm not gonna go into why I did that, 785 00:34:57.540 --> 00:34:58.980 but you know, essentially you just made it easier 786 00:34:58.980 --> 00:35:00.810 to do it this way. 787 00:35:00.810 --> 00:35:03.290 I did this scatterplot adjusted gross over time 788 00:35:03.290 --> 00:35:05.880 to see if I could visually see any of those patterns. 789 00:35:05.880 --> 00:35:08.820 And, you know, adjusted gross is sort of going up 790 00:35:08.820 --> 00:35:10.080 when you look at the outliers, 791 00:35:10.080 --> 00:35:12.240 but overall it just clusters really low, 792 00:35:12.240 --> 00:35:14.820 like we saw in that box and whisker plot. 793 00:35:14.820 --> 00:35:16.740 And I did look at a box and whiskers 794 00:35:16.740 --> 00:35:18.540 of adjusted gross over time to see if there were 795 00:35:18.540 --> 00:35:21.030 any patterns in the trends overall. 796 00:35:21.030 --> 00:35:26.030 And clearly the Golden Age of cinema, the 1960s ish here. 797 00:35:26.490 --> 00:35:28.090 Whoops. Now what happened there? 798 00:35:29.130 --> 00:35:33.900 Interesting how huge the range is of adjusted gross. 799 00:35:33.900 --> 00:35:37.170 Much larger than it is for more recent years. 800 00:35:37.170 --> 00:35:39.930 So some of these movies, or some of these years had movies, 801 00:35:39.930 --> 00:35:43.390 that really accumulated value over time 802 00:35:44.430 --> 00:35:46.560 or accumulated gross revenues 803 00:35:46.560 --> 00:35:48.240 and maybe partially due to inflation, 804 00:35:48.240 --> 00:35:50.160 I'm not sure exactly what it means, 805 00:35:50.160 --> 00:35:52.980 but more recently, much less so. 806 00:35:52.980 --> 00:35:56.910 But again, the 2000s look a little bit smaller, lower, 807 00:35:56.910 --> 00:36:00.090 overall than for instance the 2010s 808 00:36:00.090 --> 00:36:03.120 and maybe the similar to the 1990s. 809 00:36:03.120 --> 00:36:05.520 Long story short, not sure what I saw there. 810 00:36:05.520 --> 00:36:08.850 I also looked at things like the key people. 811 00:36:08.850 --> 00:36:12.600 So Clint Eastwood appears in 20 of the top 1,000 movies, 812 00:36:12.600 --> 00:36:14.910 either as a director and/or an actor. 813 00:36:14.910 --> 00:36:16.890 Robert De Niro close behind, et cetera, et cetera. 814 00:36:16.890 --> 00:36:20.310 I started to look at those by decade again, 815 00:36:20.310 --> 00:36:22.440 and did see the 2000s. 816 00:36:22.440 --> 00:36:24.090 It's different names in the different decades. 817 00:36:24.090 --> 00:36:25.080 No big surprise there. 818 00:36:25.080 --> 00:36:27.990 There may have been one or two overlaps, I can't remember. 819 00:36:27.990 --> 00:36:30.790 But long story short, one of the things I looked at 820 00:36:32.295 --> 00:36:36.727 was it does appear that there are more repeat actors 821 00:36:38.010 --> 00:36:43.010 and higher adjusted gross revenue for later time periods. 822 00:36:43.860 --> 00:36:47.160 Sort of implying that like the importance of star power 823 00:36:47.160 --> 00:36:51.150 increased over time, which feels fair and right, 824 00:36:51.150 --> 00:36:53.430 and the data seems to bear that out. 825 00:36:53.430 --> 00:36:58.260 I also noticed that the genres shifted over the decades. 826 00:36:58.260 --> 00:37:00.840 So I sort of sorted by the number of movies 827 00:37:00.840 --> 00:37:03.180 falling in particular genres. 828 00:37:03.180 --> 00:37:06.870 And looked at the top 5 and/or the top 10 and noticed 829 00:37:06.870 --> 00:37:11.010 that like for instance, comedy moved up over time, 830 00:37:11.010 --> 00:37:12.360 you know, by one or two points. 831 00:37:12.360 --> 00:37:14.670 Crime and thrillers went down. 832 00:37:14.670 --> 00:37:17.040 And the associated revenue is also increased. 833 00:37:17.040 --> 00:37:18.360 So I have like notes that are explaining 834 00:37:18.360 --> 00:37:20.940 some of the key ideas here and what I noticed. 835 00:37:20.940 --> 00:37:25.940 Long story short, did not find great insights 836 00:37:26.340 --> 00:37:28.500 into why this stuff is happening, 837 00:37:28.500 --> 00:37:29.640 but I certainly did bubble up 838 00:37:29.640 --> 00:37:32.070 that there's something going on in that decade. 839 00:37:32.070 --> 00:37:34.920 So you can run with this data set as it is, 840 00:37:34.920 --> 00:37:37.470 there is potentially stuff to talk about here, 841 00:37:37.470 --> 00:37:40.230 and certainly stuff to dig into a little bit deeper. 842 00:37:40.230 --> 00:37:41.910 Maybe you'll find more insights 843 00:37:41.910 --> 00:37:45.120 if you spend a little bit more time with this data set. 844 00:37:45.120 --> 00:37:48.450 So, I hope this is helpful and I hope that you find 845 00:37:48.450 --> 00:37:52.110 this project and this data set interesting to work with. 846 00:37:52.110 --> 00:37:53.580 Plenty to start with here, 847 00:37:53.580 --> 00:37:56.643 but clearly more room to play with. 848 00:37:57.900 --> 00:38:00.210 Oh, and I completely forgot to mention, 849 00:38:00.210 --> 00:38:03.660 I pointed out that I had this Dupe Year column over here 850 00:38:03.660 --> 00:38:05.280 and that I was gonna explain it and I didn't. 851 00:38:05.280 --> 00:38:09.720 So let me go back and explain that, which is this. 852 00:38:09.720 --> 00:38:11.010 In the summary statistics, 853 00:38:11.010 --> 00:38:13.740 I pointed out that I did all of these pivot tables 854 00:38:13.740 --> 00:38:16.890 where I was grouping by decade, 855 00:38:16.890 --> 00:38:19.803 and then I did some where I was grouping by year. 856 00:38:20.760 --> 00:38:23.400 You will notice if you create pivot tables 857 00:38:23.400 --> 00:38:28.400 and you use Group By on a certain field, 858 00:38:29.070 --> 00:38:34.070 every pivot table you create will group by that in that way. 859 00:38:34.350 --> 00:38:37.890 So all these pivot tables where I'm grouping by decade, 860 00:38:37.890 --> 00:38:40.143 as soon as I created this pivot table, 861 00:38:41.070 --> 00:38:45.240 this one was already here and already grouping by the year. 862 00:38:45.240 --> 00:38:46.710 As soon as I created this one 863 00:38:46.710 --> 00:38:48.570 and pulled the year into the rows, 864 00:38:48.570 --> 00:38:51.420 it followed the exact same Group By. 865 00:38:51.420 --> 00:38:53.640 So the Group By is sort of a universal thing. 866 00:38:53.640 --> 00:38:56.400 And so when I wanted to do different grouping, 867 00:38:56.400 --> 00:39:01.260 one by decade and one by actual year, I couldn't do it. 868 00:39:01.260 --> 00:39:03.960 So I had to create that duplicate year column 869 00:39:03.960 --> 00:39:07.050 and group that column by year. 870 00:39:07.050 --> 00:39:10.650 And if I use that column, that field in another pivot table, 871 00:39:10.650 --> 00:39:11.700 it's gonna group by year. 872 00:39:11.700 --> 00:39:14.040 So that's why I have that duplicate there. 873 00:39:14.040 --> 00:39:15.390 I might have done one other thing with it 874 00:39:15.390 --> 00:39:16.223 that I can't remember, 875 00:39:16.223 --> 00:39:19.143 but you'll probably notice it if you poke around the data.