1 00:00:01,220 --> 00:00:03,680 - [Instructor] Okay. Now we've seen separate chaining. 2 00:00:03,680 --> 00:00:07,630 We've seen two collision resolution policies, 3 00:00:07,630 --> 00:00:10,340 linear probing, and quadratic probing. 4 00:00:10,340 --> 00:00:12,560 Now we're gonna look at one more collision 5 00:00:12,560 --> 00:00:16,010 resolution policy called double hashing. 6 00:00:16,010 --> 00:00:18,730 And double hashing is another approach 7 00:00:18,730 --> 00:00:21,070 to resolving hash collisions. 8 00:00:21,070 --> 00:00:22,530 We've seen that linear probing 9 00:00:22,530 --> 00:00:24,990 is prone to primary clustering. 10 00:00:24,990 --> 00:00:26,810 And while quadratic probing 11 00:00:26,810 --> 00:00:29,240 is designed to eliminate primary clustering, 12 00:00:29,240 --> 00:00:30,600 we've seen that it's prone 13 00:00:30,600 --> 00:00:32,440 to what's called secondary clustering, 14 00:00:32,440 --> 00:00:33,950 where the clusters are distributed 15 00:00:33,950 --> 00:00:37,210 throughout the table, but they exist nonetheless. 16 00:00:37,210 --> 00:00:38,950 Double hashing is designed to address 17 00:00:38,950 --> 00:00:40,333 both of these problems. 18 00:00:41,240 --> 00:00:43,790 So let's go back to our comparison of linear probing 19 00:00:43,790 --> 00:00:44,963 and quadratic probing. 20 00:00:46,890 --> 00:00:49,000 As you recall with linear probing, 21 00:00:49,000 --> 00:00:51,040 we have a fixed stride. 22 00:00:51,040 --> 00:00:52,670 Typically one. 23 00:00:52,670 --> 00:00:54,690 And if we have a collision, 24 00:00:54,690 --> 00:00:57,690 we step from position to position 25 00:00:57,690 --> 00:01:01,670 until we find an open place to make our insert. 26 00:01:01,670 --> 00:01:04,670 In the case of quadratic probing, we vary our stride 27 00:01:04,670 --> 00:01:08,280 with each step using the square of the number of steps. 28 00:01:08,280 --> 00:01:10,120 So the first step is one. 29 00:01:10,120 --> 00:01:11,600 The second step is four. 30 00:01:11,600 --> 00:01:12,960 The third step is nine. 31 00:01:12,960 --> 00:01:15,093 The fourth step is 16 and so on. 32 00:01:17,700 --> 00:01:21,390 Now, in the case of double hashing, we have a fixed stride 33 00:01:21,390 --> 00:01:25,400 for any given key, but the stride is calculated separately 34 00:01:25,400 --> 00:01:29,683 for each key using a second independent hash function. 35 00:01:30,950 --> 00:01:34,170 So two different keys that yield the same index 36 00:01:34,170 --> 00:01:37,080 is calculated from the primary hash function, 37 00:01:37,080 --> 00:01:39,000 should have a different stride 38 00:01:39,000 --> 00:01:41,633 as calculated from the secondary hash function. 39 00:01:42,490 --> 00:01:44,340 And that's how we prevent clustering. 40 00:01:46,520 --> 00:01:49,400 Now, what should the secondary hash function look like? 41 00:01:49,400 --> 00:01:51,630 Well, it clearly has to be different 42 00:01:51,630 --> 00:01:54,580 from the hash function used to get the index. 43 00:01:54,580 --> 00:01:57,310 And ideally, the output of the primary hash function 44 00:01:57,310 --> 00:01:58,920 and the secondary hash function 45 00:01:58,920 --> 00:02:01,440 should be what's called pairwise independent, 46 00:02:01,440 --> 00:02:03,393 meaning that they're uncorrelated. 47 00:02:04,810 --> 00:02:06,430 The secondary hash function, 48 00:02:06,430 --> 00:02:09,290 unlike the primary hash function should return values 49 00:02:09,290 --> 00:02:13,050 in the range of one to table size minus one. 50 00:02:13,050 --> 00:02:15,170 Those would be valid step sizes. 51 00:02:15,170 --> 00:02:17,570 Obviously, we don't want it to return zero 52 00:02:17,570 --> 00:02:20,010 because then it wouldn't probe. 53 00:02:20,010 --> 00:02:23,560 So that's why we have this slightly different range. 54 00:02:23,560 --> 00:02:26,520 And again, as with our primary hash function, 55 00:02:26,520 --> 00:02:29,510 the secondary hash function should distribute values 56 00:02:29,510 --> 00:02:32,273 as uniformly as possible within that range. 57 00:02:33,260 --> 00:02:36,190 So for example, we could calculate some number, h, 58 00:02:36,190 --> 00:02:39,470 and we could use Horner's method or some other method. 59 00:02:39,470 --> 00:02:43,240 And we would return p minus h mod p, 60 00:02:43,240 --> 00:02:47,240 where p is some prime number less than the table size. 61 00:02:47,240 --> 00:02:49,900 And that's a minimal example 62 00:02:49,900 --> 00:02:51,483 of a secondary hash function. 63 00:02:52,530 --> 00:02:55,580 The primary hash function gives us the starting point 64 00:02:55,580 --> 00:02:57,150 for our probe sequence. 65 00:02:57,150 --> 00:03:00,480 And the secondary hash function gives us the stride 66 00:03:00,480 --> 00:03:01,713 if we need to probe. 67 00:03:04,340 --> 00:03:08,530 So here we have a table showing our minimal 68 00:03:08,530 --> 00:03:10,210 primary hash function. 69 00:03:10,210 --> 00:03:13,990 This is f of x equals x mod 17. 70 00:03:13,990 --> 00:03:17,270 And we have a secondary hash function, g of x 71 00:03:17,270 --> 00:03:22,270 which is a smaller prime number, 13 minus x mod 13. 72 00:03:24,920 --> 00:03:29,310 And we see here that for each different key, 73 00:03:29,310 --> 00:03:33,130 I have shown here in the table, the primary hash function 74 00:03:33,130 --> 00:03:37,130 and the secondary hash function yield different values. 75 00:03:37,130 --> 00:03:38,780 They're are pairwise independent. 76 00:03:39,860 --> 00:03:42,820 Okay. And that's an ideal. 77 00:03:42,820 --> 00:03:46,470 Now let's look at what happens when we have a collision. 78 00:03:46,470 --> 00:03:51,140 Now here, I have all these keys, zero, 17, 34, 51, 68. 79 00:03:51,140 --> 00:03:53,530 They're all multiples of 17. 80 00:03:53,530 --> 00:03:57,570 So when we apply that to the primary hash function, 81 00:03:57,570 --> 00:03:59,050 we're always gonna get zero. 82 00:03:59,050 --> 00:04:03,780 So each one of these keys is going to have a hash collision 83 00:04:03,780 --> 00:04:05,433 in its primary hash function. 84 00:04:06,800 --> 00:04:08,010 But look at what's happening 85 00:04:08,010 --> 00:04:10,680 in the case of the secondary hash function. 86 00:04:10,680 --> 00:04:14,010 We get different values from our secondary hash function. 87 00:04:14,010 --> 00:04:16,030 Zero mod 13 is zero. 88 00:04:16,030 --> 00:04:19,180 And so our secondary hash gives us a stride of 13 89 00:04:19,180 --> 00:04:20,960 when the key is zero. 90 00:04:20,960 --> 00:04:23,920 When the key is 17, we get a stride of nine. 91 00:04:23,920 --> 00:04:25,957 17 mod 13 is four. 92 00:04:25,957 --> 00:04:28,700 And 13 minus four is nine. 93 00:04:28,700 --> 00:04:32,400 When the key is 34, we got a stride of five 94 00:04:32,400 --> 00:04:35,420 because 34 mod 13 is eight, 95 00:04:35,420 --> 00:04:38,760 and 13 minus eight is five, and so on. 96 00:04:38,760 --> 00:04:41,210 So you see, even though we have collisions 97 00:04:41,210 --> 00:04:42,750 in the primary key, 98 00:04:42,750 --> 00:04:45,403 we're getting different values for the secondary key. 99 00:04:46,270 --> 00:04:47,103 So you see, 100 00:04:47,103 --> 00:04:49,810 even when we have a collision on the insert index 101 00:04:49,810 --> 00:04:52,300 calculated from the primary hash function, 102 00:04:52,300 --> 00:04:55,530 we get a different stride from the secondary hash function. 103 00:04:55,530 --> 00:04:58,933 And so these probe sequences will all be different. 104 00:04:59,970 --> 00:05:01,673 Let's take a look at an example. 105 00:05:04,100 --> 00:05:06,540 Here again, we have our primary hash function, 106 00:05:06,540 --> 00:05:09,330 f of x equals x mod 17, 107 00:05:09,330 --> 00:05:10,960 and our secondary hash function, 108 00:05:10,960 --> 00:05:15,960 g of x equals 13 minus x mod 13. 109 00:05:16,370 --> 00:05:19,570 And it's important that this prime number 110 00:05:19,570 --> 00:05:21,393 be less than the table size. 111 00:05:22,550 --> 00:05:23,890 Okay? 112 00:05:23,890 --> 00:05:26,420 So let's make our first insert. 113 00:05:26,420 --> 00:05:28,410 Let's say we want to insert five. 114 00:05:28,410 --> 00:05:32,470 F of x which is this function here is gonna give us five, 115 00:05:32,470 --> 00:05:35,090 but g of x is going to give us eight. 116 00:05:35,090 --> 00:05:36,590 Now we don't have anything in our table, 117 00:05:36,590 --> 00:05:37,910 so we're not gonna have a collision. 118 00:05:37,910 --> 00:05:41,560 We're just gonna insert five at the specified index. 119 00:05:41,560 --> 00:05:42,793 So there we have five. 120 00:05:44,350 --> 00:05:48,170 Now on the next insert, let's say we want to insert 56. 121 00:05:48,170 --> 00:05:52,900 56 is gonna give us the same primary hash result. 122 00:05:52,900 --> 00:05:54,513 F of x equals five. 123 00:05:55,490 --> 00:05:57,430 Now g of x, 124 00:05:57,430 --> 00:06:01,120 the secondary hash function gives us a stride of nine. 125 00:06:01,120 --> 00:06:05,060 So when we insert, we're going to take a stride of nine, 126 00:06:05,060 --> 00:06:07,920 one, two, three, four, 127 00:06:07,920 --> 00:06:12,280 five, six, seven, eight, nine. 128 00:06:12,280 --> 00:06:14,530 And 56 is gonna go here. 129 00:06:14,530 --> 00:06:16,520 Now let's insert 39. 130 00:06:16,520 --> 00:06:19,760 Again, 39, I've set it up this way, 131 00:06:19,760 --> 00:06:21,130 is gonna cause a collision 132 00:06:21,130 --> 00:06:25,870 because f of x 39 mod 17 is five. 133 00:06:25,870 --> 00:06:28,160 But g of x gives us a different value 134 00:06:28,160 --> 00:06:30,030 than we got before, before it was nine. 135 00:06:30,030 --> 00:06:31,620 Now it's 13. 136 00:06:31,620 --> 00:06:35,360 So our probe is gonna be one, two, 137 00:06:35,360 --> 00:06:38,410 three, four, five, six, seven, 138 00:06:38,410 --> 00:06:41,457 eight, nine, 10, 11, 12. 139 00:06:41,457 --> 00:06:44,250 'Cause we wrap around 13. 140 00:06:44,250 --> 00:06:47,740 And so 39 is gonna go here. 141 00:06:47,740 --> 00:06:49,410 Let's do one more. 142 00:06:49,410 --> 00:06:50,670 Here's 22. 143 00:06:50,670 --> 00:06:52,710 Again, f of x gives us five. 144 00:06:52,710 --> 00:06:54,430 That's our primary hash function. 145 00:06:54,430 --> 00:06:57,180 22 mod 17 is five. 146 00:06:57,180 --> 00:06:59,690 But g of x gives us four. 147 00:06:59,690 --> 00:07:04,103 Because 13 minus 22 mod 13 is four. 148 00:07:05,080 --> 00:07:07,450 And so our stride is going to be four, 149 00:07:07,450 --> 00:07:10,670 one, two, three, four. 150 00:07:10,670 --> 00:07:12,393 And 22 goes there. 151 00:07:13,270 --> 00:07:15,223 So what just happened? 152 00:07:16,090 --> 00:07:18,163 Well, we had these keys. 153 00:07:19,000 --> 00:07:23,830 This was the result of our primary hash function, all five. 154 00:07:23,830 --> 00:07:26,260 And then our secondary hash function gave us 155 00:07:26,260 --> 00:07:29,630 these values eight, nine, 13, four. 156 00:07:29,630 --> 00:07:31,520 And so each of these inserts 157 00:07:31,520 --> 00:07:34,501 follows a different probe sequence. 158 00:07:34,501 --> 00:07:37,533 That's the idea behind double hashing. 159 00:07:39,180 --> 00:07:42,100 Now what's the probability of hash collisions 160 00:07:42,100 --> 00:07:43,563 having the same stride? 161 00:07:44,810 --> 00:07:47,690 In order for hash collisions to have the same stride 162 00:07:47,690 --> 00:07:50,860 for their probe sequence, both the primary hash function 163 00:07:50,860 --> 00:07:52,520 and the secondary hash function 164 00:07:52,520 --> 00:07:56,290 would have to return the same value for two different keys. 165 00:07:56,290 --> 00:07:59,630 Now in an ideal world with perfect hash functions, 166 00:07:59,630 --> 00:08:02,170 the outputs would be distributed uniformly, 167 00:08:02,170 --> 00:08:04,653 just as if the hash functions were random. 168 00:08:05,500 --> 00:08:09,300 And we'd have something like a one over n 169 00:08:09,300 --> 00:08:11,867 for the likelihood of any given value 170 00:08:11,867 --> 00:08:14,380 from our primary hash function, 171 00:08:14,380 --> 00:08:16,210 times one over n, 172 00:08:16,210 --> 00:08:17,940 the likelihood of any value 173 00:08:17,940 --> 00:08:20,180 from our secondary hash function. 174 00:08:20,180 --> 00:08:22,380 And then the probability of both 175 00:08:22,380 --> 00:08:26,140 of those yielding the same values would be one 176 00:08:26,140 --> 00:08:27,063 over n squared. 177 00:08:28,010 --> 00:08:30,920 Now, of course, this is a lower limit on the probability 178 00:08:30,920 --> 00:08:34,330 that hash collisions have the same stride 179 00:08:34,330 --> 00:08:35,590 because in the real world, 180 00:08:35,590 --> 00:08:37,600 our hash functions won't be perfect. 181 00:08:37,600 --> 00:08:39,700 That is, they won't distribute the values 182 00:08:39,700 --> 00:08:43,653 with perfect uniformity, but we want them to be close. 183 00:08:44,700 --> 00:08:47,860 So in the real world, this probability will be greater 184 00:08:47,860 --> 00:08:51,730 than one over n squared, but hopefully not too much greater. 185 00:08:51,730 --> 00:08:53,200 So if we have a hash table 186 00:08:53,200 --> 00:08:55,950 of around a thousand elements, the probability 187 00:08:55,950 --> 00:08:59,040 of hash collisions having the same probe sequence 188 00:08:59,040 --> 00:09:01,402 would be around one in a million. 189 00:09:01,402 --> 00:09:02,352 And that's not bad. 190 00:09:04,150 --> 00:09:07,050 So again, to compare our CRPs, 191 00:09:07,050 --> 00:09:10,350 our collision resolution policies, 192 00:09:10,350 --> 00:09:15,350 we see that all of these probing approaches probe 193 00:09:16,330 --> 00:09:18,260 on a collision. 194 00:09:18,260 --> 00:09:19,890 Linear probing, quadratic, probing, 195 00:09:19,890 --> 00:09:22,090 and double hashing, whereas separate chaining, 196 00:09:22,090 --> 00:09:23,223 we extend the chain. 197 00:09:24,620 --> 00:09:27,940 Again, all three of these probing approaches 198 00:09:27,940 --> 00:09:29,910 have a fixed upper limit on the number 199 00:09:29,910 --> 00:09:32,730 of objects that we can insert into our hash table. 200 00:09:32,730 --> 00:09:35,520 Whereas again, primary chaining is limited only 201 00:09:35,520 --> 00:09:37,583 by the memory and system constraints. 202 00:09:38,810 --> 00:09:41,700 Linear probing has a fixed stride, typically one. 203 00:09:41,700 --> 00:09:43,930 And that's always the same stride 204 00:09:43,930 --> 00:09:46,063 whenever we have a hash collision. 205 00:09:47,190 --> 00:09:50,390 Quadratic probing varies the stride on each step. 206 00:09:50,390 --> 00:09:55,390 We take the square of the step to get our stride 207 00:09:55,490 --> 00:09:56,743 for any given step. 208 00:09:57,640 --> 00:10:00,190 And double hashing has a fixed stride, 209 00:10:00,190 --> 00:10:02,693 but it's calculated by a second hash. 210 00:10:03,660 --> 00:10:06,710 Linear probing is prone to primary clustering. 211 00:10:06,710 --> 00:10:10,040 Quadratic probing is prone to secondary clustering. 212 00:10:10,040 --> 00:10:13,500 Double hashing reduces clustering substantially. 213 00:10:13,500 --> 00:10:15,560 And of course, in the case of separate chaining, 214 00:10:15,560 --> 00:10:16,993 clustering does not occur. 215 00:10:19,550 --> 00:10:22,480 So in summary, with double hashing, 216 00:10:22,480 --> 00:10:26,600 like other open addressing techniques we've seen, 217 00:10:26,600 --> 00:10:29,493 we only allow a single object at a given index. 218 00:10:30,470 --> 00:10:31,790 When we have a hash collision, 219 00:10:31,790 --> 00:10:34,490 we probe our hash table one step at a time, 220 00:10:34,490 --> 00:10:36,470 but with a stride that's been calculated 221 00:10:36,470 --> 00:10:38,973 by a second independent hash function. 222 00:10:40,660 --> 00:10:42,900 And because we use a second hash function, 223 00:10:42,900 --> 00:10:45,110 the stride depends on the data 224 00:10:45,110 --> 00:10:46,890 and this makes it very unlikely 225 00:10:46,890 --> 00:10:49,210 that two insertions with the same hash value 226 00:10:49,210 --> 00:10:52,610 for the first index would follow the same probe sequence. 227 00:10:52,610 --> 00:10:54,530 Again, they'd have to have in effect, 228 00:10:54,530 --> 00:10:56,883 two concurrent hash collisions. 229 00:10:58,420 --> 00:11:01,700 And finally, double hashing, as you might expect, 230 00:11:01,700 --> 00:11:03,360 has a fixed limit on the number 231 00:11:03,360 --> 00:11:06,260 of objects that we can insert into our hash table. 232 00:11:06,260 --> 00:11:08,680 So as with linear probing, 233 00:11:08,680 --> 00:11:11,620 and quadratic probing, rehashing is gonna be a thing 234 00:11:11,620 --> 00:11:13,590 when we're doing double hashing as well. 235 00:11:13,590 --> 00:11:16,493 So now I'll leave you with some questions. 236 00:11:18,400 --> 00:11:20,090 We have two basic strategies 237 00:11:20,090 --> 00:11:23,200 for hash collision, chaining and probing. 238 00:11:23,200 --> 00:11:26,100 Where probing methods are linear probing, 239 00:11:26,100 --> 00:11:28,790 quadratic probing, and double hashing. 240 00:11:28,790 --> 00:11:31,093 Which do you think uses more memory? 241 00:11:32,570 --> 00:11:34,933 Which of these do you think is faster? 242 00:11:36,760 --> 00:11:39,363 And how would you calculate their complexities? 243 00:11:40,440 --> 00:11:42,670 Give these questions some thought 244 00:11:42,670 --> 00:11:44,760 and we'll see you in a later video. 245 00:11:44,760 --> 00:11:45,760 That's all for now. 246 00:11:45,760 --> 00:11:46,593 Thanks.