WEBVTT 1 00:00:00.360 --> 00:00:02.700 All right, this is the last thing we're gonna talk about 2 00:00:02.700 --> 00:00:07.260 during this prep week, week zero, all about data analytics. 3 00:00:07.260 --> 00:00:09.840 And now we're gonna talk about, as it says here, 4 00:00:09.840 --> 00:00:13.260 data analytics with a side visualization. 5 00:00:13.260 --> 00:00:16.920 Because here's the thing, if you are doing data analytics 6 00:00:16.920 --> 00:00:19.890 you need to visualize your data. 7 00:00:19.890 --> 00:00:24.570 You cannot understand your data fully, oftentimes, 8 00:00:24.570 --> 00:00:27.000 without actually seeing it with your eyeballs. 9 00:00:27.000 --> 00:00:29.460 You will see stuff with your eyeballs 10 00:00:29.460 --> 00:00:31.920 that you would otherwise miss. 11 00:00:31.920 --> 00:00:34.380 The numbers won't tell you everything on their own, 12 00:00:34.380 --> 00:00:38.100 and here is the best demonstration of it. 13 00:00:38.100 --> 00:00:40.590 You can see four data sets here, okay? 14 00:00:40.590 --> 00:00:43.410 Numerals I, II, III, IV. 15 00:00:43.410 --> 00:00:47.370 Each dataset has x and y, two variables. 16 00:00:47.370 --> 00:00:49.980 Each dataset has 11 data points. 17 00:00:49.980 --> 00:00:53.220 There are 88 numbers here. This is not big data. 18 00:00:53.220 --> 00:00:56.130 This is tiny, tiny little data, right? 19 00:00:56.130 --> 00:01:00.660 And yet, I challenge you to analyze this data. 20 00:01:00.660 --> 00:01:02.133 Tell me what you see here. 21 00:01:03.330 --> 00:01:06.390 Tell me what insights you can gain. 22 00:01:06.390 --> 00:01:09.090 Now, if you spent a little bit of time with this, 23 00:01:09.090 --> 00:01:11.460 I bet you could tell me, eventually, 24 00:01:11.460 --> 00:01:14.520 that data set IV all of the x-values 25 00:01:14.520 --> 00:01:16.530 are the same number except for one. 26 00:01:16.530 --> 00:01:18.360 Maybe you noticed that. 27 00:01:18.360 --> 00:01:19.770 If you keep staring at this, 28 00:01:19.770 --> 00:01:22.410 maybe you'll notice that all of the other x-values 29 00:01:22.410 --> 00:01:24.960 for all three of the other data sets are the same, 30 00:01:24.960 --> 00:01:27.513 10.0, 8.0, 13.0, 9.0, et cetera. 31 00:01:28.620 --> 00:01:30.900 Maybe some of you will notice 32 00:01:30.900 --> 00:01:35.280 that all of the x-values are integers .0, .0, .0, 33 00:01:35.280 --> 00:01:38.010 and all the y's are dot something else. 34 00:01:38.010 --> 00:01:39.813 Two decimal places, okay. 35 00:01:41.280 --> 00:01:43.920 What else can we learn about this data? 36 00:01:43.920 --> 00:01:48.780 I guess none of the values are below maybe 4-ish 37 00:01:48.780 --> 00:01:52.050 and none of them are above 18, 19. 38 00:01:52.050 --> 00:01:56.220 Like that's literally all I can tell. 39 00:01:56.220 --> 00:01:58.800 So I can make absolutely no useful sense 40 00:01:58.800 --> 00:02:01.080 of this data when looking at a table of numbers, 41 00:02:01.080 --> 00:02:03.960 we suck at that, so that's good to know. 42 00:02:03.960 --> 00:02:07.200 Now what do we do next when we're analyzing data? 43 00:02:07.200 --> 00:02:10.470 We usually apply statistics, so let's do that. 44 00:02:10.470 --> 00:02:13.680 Turns out statistics tells us all four 45 00:02:13.680 --> 00:02:16.170 of these data sets are pretty much exactly the same. 46 00:02:16.170 --> 00:02:19.890 The same mean average of x and Y, 47 00:02:19.890 --> 00:02:22.410 the same variance of x and Y, 48 00:02:22.410 --> 00:02:26.010 the same correlation, regression, and R squared, 49 00:02:26.010 --> 00:02:27.570 which is, you know, statistical stuff. 50 00:02:27.570 --> 00:02:29.310 Who cares if you don't know what it means? 51 00:02:29.310 --> 00:02:31.770 All I'm saying here is statistics tell us 52 00:02:31.770 --> 00:02:33.840 they're pretty much identical. 53 00:02:33.840 --> 00:02:35.850 So they're identical, I guess, right? 54 00:02:35.850 --> 00:02:37.950 Nope. They are far from identical. 55 00:02:37.950 --> 00:02:40.170 These are wildly different data sets. 56 00:02:40.170 --> 00:02:42.150 Only the visuals tell you that. 57 00:02:42.150 --> 00:02:46.290 So you must visualize your data when you're analyzing it. 58 00:02:46.290 --> 00:02:48.660 Don't just do pivot tables. 59 00:02:48.660 --> 00:02:51.060 You know, I didn't go crazy with visualizations 60 00:02:51.060 --> 00:02:54.540 on that data set that we looked at earlier, but I did a few, 61 00:02:54.540 --> 00:02:57.930 some distribution diagrams, a couple scatter plots. 62 00:02:57.930 --> 00:03:00.480 They helped me see what's going on in the data 63 00:03:00.480 --> 00:03:02.700 in a way that the statistics cannot do, 64 00:03:02.700 --> 00:03:04.900 and certainly the table numbers will not do. 65 00:03:05.880 --> 00:03:07.680 Why? Why does that work? 66 00:03:07.680 --> 00:03:09.930 Why is it so important to do that? 67 00:03:09.930 --> 00:03:12.390 The primary argument, there are really two of them, 68 00:03:12.390 --> 00:03:17.130 is that literally up to half of your brain's job 69 00:03:17.130 --> 00:03:19.380 is processing visual information. 70 00:03:19.380 --> 00:03:22.020 Truly half of your brain is devoted to processing visuals, 71 00:03:22.020 --> 00:03:24.720 which by the way, come in through your eyeballs, 72 00:03:24.720 --> 00:03:28.710 which contain 70% of the sensory receptors in your body. 73 00:03:28.710 --> 00:03:30.930 You are a visual creature. 74 00:03:30.930 --> 00:03:35.520 On top of that, you're a visual learner. 75 00:03:35.520 --> 00:03:38.820 Very important to understand picture superiority effect. 76 00:03:38.820 --> 00:03:41.400 If you share text with people 77 00:03:41.400 --> 00:03:43.140 and you come back and ask them what they remember 78 00:03:43.140 --> 00:03:45.750 they remember a tiny, tiny little bit. 79 00:03:45.750 --> 00:03:49.200 If you just add images to the same text 80 00:03:49.200 --> 00:03:51.150 that retention rate skyrockets, 81 00:03:51.150 --> 00:03:53.070 that's called the pictures superior effect, 82 00:03:53.070 --> 00:03:55.560 a well-known psychological effect. 83 00:03:55.560 --> 00:03:59.730 So you must visualize stuff to help your audience understand 84 00:03:59.730 --> 00:04:01.590 and remember your content. 85 00:04:01.590 --> 00:04:04.170 Now, what do we do about this? 86 00:04:04.170 --> 00:04:05.940 How do we figure out what to do? 87 00:04:05.940 --> 00:04:08.670 How do we identify the visuals et cetera, et cetera? 88 00:04:08.670 --> 00:04:10.590 We're gonna talk more about that 89 00:04:10.590 --> 00:04:12.060 when we talk about picking the right chart, 90 00:04:12.060 --> 00:04:13.560 later in this course. 91 00:04:13.560 --> 00:04:17.250 But I think it's important for you to identify and recognize 92 00:04:17.250 --> 00:04:22.200 the four primary categories of visualizations. 93 00:04:22.200 --> 00:04:24.210 And because if you think in these terms 94 00:04:24.210 --> 00:04:26.640 then it makes it easier to know what to do 95 00:04:26.640 --> 00:04:28.410 with your data as you're thinking 96 00:04:28.410 --> 00:04:32.760 about how to use visuals to bubble up the insights 97 00:04:32.760 --> 00:04:35.250 and also communicate them with your audiences. 98 00:04:35.250 --> 00:04:36.420 So briefly, like I said, 99 00:04:36.420 --> 00:04:37.560 we're gonna talk about this more 100 00:04:37.560 --> 00:04:40.710 in module five, six, or seven. 101 00:04:40.710 --> 00:04:42.570 I can't remember which one. 102 00:04:42.570 --> 00:04:44.940 Number one, distributions. 103 00:04:44.940 --> 00:04:48.240 It is very important during data analytics 104 00:04:48.240 --> 00:04:50.670 that you look at distributions of your dataset. 105 00:04:50.670 --> 00:04:52.410 I showed you in the other dataset 106 00:04:52.410 --> 00:04:53.970 some box and whisker plots. 107 00:04:53.970 --> 00:04:56.790 They help me see that the range of values, 108 00:04:56.790 --> 00:04:58.980 what's typical, oh, the median's way down here. 109 00:04:58.980 --> 00:05:00.270 And there's some outliers way up here. 110 00:05:00.270 --> 00:05:01.380 How weird are those outliers? 111 00:05:01.380 --> 00:05:02.760 They look pretty darn weird 112 00:05:02.760 --> 00:05:04.650 'cause the median is way down here. 113 00:05:04.650 --> 00:05:06.930 Distributions give me a sense of that range, 114 00:05:06.930 --> 00:05:08.730 the overall population, the spread, 115 00:05:08.730 --> 00:05:11.340 the clustering of the values. 116 00:05:11.340 --> 00:05:14.880 As an analyst, it's good context 117 00:05:14.880 --> 00:05:16.560 'cause then when I look at this movie over here 118 00:05:16.560 --> 00:05:19.440 it has a number I know, is that a weird number 119 00:05:19.440 --> 00:05:21.360 or normal number, right? 120 00:05:21.360 --> 00:05:23.820 I know where it lives in the distribution, 121 00:05:23.820 --> 00:05:28.440 that's why the power of distributions, one of the powers. 122 00:05:28.440 --> 00:05:31.530 We also frequently think about visualizations 123 00:05:31.530 --> 00:05:35.220 in the context of magnitudes and ranks, 124 00:05:35.220 --> 00:05:36.600 which are slightly different. 125 00:05:36.600 --> 00:05:39.030 When I wanna show you the magnitude of numbers, 126 00:05:39.030 --> 00:05:41.610 it's about allowing you to see the actual value, 127 00:05:41.610 --> 00:05:43.200 but sometimes I just need you to know the rank. 128 00:05:43.200 --> 00:05:45.090 This one's in position, one, two, three, four, five, 129 00:05:45.090 --> 00:05:47.400 without knowing what the actual values are. 130 00:05:47.400 --> 00:05:50.130 Either way, both of those, which have, 131 00:05:50.130 --> 00:05:52.440 by the way, different visuals that enable those tasks, 132 00:05:52.440 --> 00:05:53.740 some overlapping ones too. 133 00:05:54.690 --> 00:05:58.773 Both of those generally enable the task of comparison. 134 00:05:59.730 --> 00:06:03.900 Data visualization, data analytics, data communications, 135 00:06:03.900 --> 00:06:06.660 comparison is not always the number one thing. 136 00:06:06.660 --> 00:06:07.830 People think of it that way. 137 00:06:07.830 --> 00:06:08.850 They use that word a lot 138 00:06:08.850 --> 00:06:10.477 and sometimes it is important to ask yourself, 139 00:06:10.477 --> 00:06:12.000 "Compared to what?" 140 00:06:12.000 --> 00:06:13.830 Almost always it's good to ask that, 141 00:06:13.830 --> 00:06:17.670 but the task for your audience isn't always comparison. 142 00:06:17.670 --> 00:06:19.650 Some charts are built for comparison. 143 00:06:19.650 --> 00:06:21.660 Some charts are terrible for comparison. 144 00:06:21.660 --> 00:06:24.570 So another major category to think about. 145 00:06:24.570 --> 00:06:27.240 Another category is composition, right? 146 00:06:27.240 --> 00:06:30.210 You want to understand the part to whole relationship, 147 00:06:30.210 --> 00:06:32.220 the proportional share of values, 148 00:06:32.220 --> 00:06:36.480 the percentage, the overall makeup of stuff in categories, 149 00:06:36.480 --> 00:06:38.910 segmentation, there's a bunch of words for it. 150 00:06:38.910 --> 00:06:40.770 There are composition charts like pie charts, 151 00:06:40.770 --> 00:06:42.540 stacked column charts, et cetera. 152 00:06:42.540 --> 00:06:45.030 There are some charts that do not enable an understanding 153 00:06:45.030 --> 00:06:47.400 of composition whatsoever. 154 00:06:47.400 --> 00:06:48.900 So you gotta think about it this way. 155 00:06:48.900 --> 00:06:50.550 And last but not least, 156 00:06:50.550 --> 00:06:53.280 for the four big categories we think of a lot of times, 157 00:06:53.280 --> 00:06:56.100 especially in analysis is correlation. 158 00:06:56.100 --> 00:06:57.000 When you wanna understand 159 00:06:57.000 --> 00:06:59.550 are these two variables moving in tandem, 160 00:06:59.550 --> 00:07:01.740 in the same or in opposite directions? 161 00:07:01.740 --> 00:07:05.850 And if so, how strongly? Is it statistically significant? 162 00:07:05.850 --> 00:07:08.910 Does it matter that as runtime goes up 163 00:07:08.910 --> 00:07:11.400 movie quality goes up, I don't even know if that's true, 164 00:07:11.400 --> 00:07:15.300 by the way it, and is it statistically significant? 165 00:07:15.300 --> 00:07:18.240 Maybe there's a very slight correlation mathematically, 166 00:07:18.240 --> 00:07:19.770 but does it really matter? 167 00:07:19.770 --> 00:07:21.960 You gotta understand this stuff, okay? 168 00:07:21.960 --> 00:07:24.780 So are you gonna use all four of those 169 00:07:24.780 --> 00:07:26.460 when you're visualizing data, 170 00:07:26.460 --> 00:07:29.220 especially during data analytics? 171 00:07:29.220 --> 00:07:31.980 Not necessarily, but you might. 172 00:07:31.980 --> 00:07:36.060 You might, especially during analytics, explore comparison, 173 00:07:36.060 --> 00:07:38.190 even though your data story may end up 174 00:07:38.190 --> 00:07:40.140 not being about comparison at all. 175 00:07:40.140 --> 00:07:43.650 So visualize your data, it's really important to understand 176 00:07:43.650 --> 00:07:45.810 what you're looking at to provide context 177 00:07:45.810 --> 00:07:48.780 so that other stuff makes sense to you. 178 00:07:48.780 --> 00:07:51.570 You will see stuff you would otherwise miss. 179 00:07:51.570 --> 00:07:53.130 And looking at all four of these, 180 00:07:53.130 --> 00:07:54.990 or at least thinking about these four categories 181 00:07:54.990 --> 00:07:56.820 and which ones you do need to apply, 182 00:07:56.820 --> 00:07:58.563 is a really good place to start.