1 00:00:00,540 --> 00:00:02,870 - [Instructor] We continue our study of hash tables 2 00:00:02,870 --> 00:00:04,853 with the topic of rehashing. 3 00:00:05,740 --> 00:00:07,650 Now I left you with a couple of questions 4 00:00:07,650 --> 00:00:09,610 from our previous video. 5 00:00:09,610 --> 00:00:11,970 One was what can we do when we run out of space 6 00:00:11,970 --> 00:00:13,490 in our hash table? 7 00:00:13,490 --> 00:00:15,530 And the other if we set our stride 8 00:00:15,530 --> 00:00:17,180 to some value greater than one, 9 00:00:17,180 --> 00:00:19,910 why is it a good idea to have a hash table size 10 00:00:19,910 --> 00:00:21,013 that's a prime number? 11 00:00:22,490 --> 00:00:26,570 Now let's pursue the second question and take a brief detour 12 00:00:26,570 --> 00:00:28,303 before we get into rehashing. 13 00:00:30,360 --> 00:00:33,840 Here we have two hash tables, one of size 11, 14 00:00:33,840 --> 00:00:34,690 which is prime. 15 00:00:34,690 --> 00:00:36,570 That's the one on the top. 16 00:00:36,570 --> 00:00:39,540 And the one below is of size 12. 17 00:00:39,540 --> 00:00:42,050 12 of course, is not prime. 18 00:00:42,050 --> 00:00:44,170 And let's say we're doing linear probing 19 00:00:44,170 --> 00:00:45,750 with a stride of two. 20 00:00:45,750 --> 00:00:50,750 And we have some key value that returns a index of zero 21 00:00:51,410 --> 00:00:53,200 in both cases. 22 00:00:53,200 --> 00:00:55,380 And our first step in linear probing, 23 00:00:55,380 --> 00:00:57,030 we'll take a stride of length two, 24 00:00:57,030 --> 00:00:59,333 and wind up at index two. 25 00:01:00,520 --> 00:01:03,830 At the next step, we'll wind up at index four, 26 00:01:03,830 --> 00:01:08,403 index six, index eight, index 10. 27 00:01:09,530 --> 00:01:12,730 Now I'll bet you can see what's gonna happen next here. 28 00:01:12,730 --> 00:01:14,410 What's going to happen, 29 00:01:14,410 --> 00:01:19,410 is that here in the table that's size 12, 30 00:01:20,640 --> 00:01:23,770 we're gonna take a stride of two, one, two. 31 00:01:23,770 --> 00:01:26,490 It's going to land us back at zero, 32 00:01:26,490 --> 00:01:27,850 and we're just gonna continue 33 00:01:27,850 --> 00:01:32,610 to cycle through these six indices. 34 00:01:32,610 --> 00:01:37,610 We will never ever hit one three, five, seven, nine, or 11. 35 00:01:39,180 --> 00:01:43,630 So here we go in the table that's prime. 36 00:01:43,630 --> 00:01:45,470 We take a stride of two. 37 00:01:45,470 --> 00:01:46,703 We wind up at one, 38 00:01:48,360 --> 00:01:53,000 three, five, seven, nine. 39 00:01:53,000 --> 00:01:55,393 And you see when we have a prime number, 40 00:01:57,380 --> 00:02:01,340 we hit every single value with a stride of two. 41 00:02:01,340 --> 00:02:05,100 Here, because two is a factor of 12, 42 00:02:05,100 --> 00:02:08,020 we kept returning to the same values again 43 00:02:08,020 --> 00:02:09,363 and again and again. 44 00:02:10,490 --> 00:02:13,160 This is one of the reasons why prime numbers 45 00:02:13,160 --> 00:02:15,463 are good sizes for hash tables. 46 00:02:16,670 --> 00:02:19,100 The idea is that if the stride 47 00:02:19,100 --> 00:02:22,020 and the table size are not coprime, 48 00:02:22,020 --> 00:02:24,360 coprime means that two numbers 49 00:02:24,360 --> 00:02:27,890 don't share any common factors other than one, 50 00:02:27,890 --> 00:02:30,680 then you can't probe the entire table 51 00:02:30,680 --> 00:02:33,140 from any given starting point. 52 00:02:33,140 --> 00:02:34,113 That's one reason. 53 00:02:35,630 --> 00:02:37,080 That's it for the detour. 54 00:02:37,080 --> 00:02:38,823 Now let's go on to rehashing. 55 00:02:41,610 --> 00:02:43,680 Now, earlier we had asked the question 56 00:02:43,680 --> 00:02:46,630 what can we do when we run out of space in our hash table? 57 00:02:46,630 --> 00:02:49,560 But it turns out that we get ourselves into trouble 58 00:02:49,560 --> 00:02:52,803 before we exhaust all of the free space in our hash table. 59 00:02:53,910 --> 00:02:55,830 And that's because the more entries we have 60 00:02:55,830 --> 00:02:57,890 in our hash table, the more likely it is 61 00:02:57,890 --> 00:03:00,213 that we have a collision on our next insert. 62 00:03:01,459 --> 00:03:03,420 And the other problem is the more entries we have 63 00:03:03,420 --> 00:03:06,810 in our table, the longer a typical probe sequence 64 00:03:06,810 --> 00:03:10,133 becomes when we're inserting, finding, and removing entries. 65 00:03:11,270 --> 00:03:15,190 And as a result, performance degrades. 66 00:03:15,190 --> 00:03:18,200 So rehashing comes to the rescue. 67 00:03:18,200 --> 00:03:19,690 What do we do? 68 00:03:19,690 --> 00:03:23,520 We set a limit, some maximum value for our load factor 69 00:03:23,520 --> 00:03:26,210 which is also called the fill percentage. 70 00:03:26,210 --> 00:03:29,400 And then on inserts, we check the load factor. 71 00:03:29,400 --> 00:03:31,380 And if it exceeds the limit, 72 00:03:31,380 --> 00:03:34,510 then we create a new, bigger table. 73 00:03:34,510 --> 00:03:36,190 And we insert all the objects 74 00:03:36,190 --> 00:03:39,350 from the old table into the new table. 75 00:03:39,350 --> 00:03:41,163 And then we delete the old table. 76 00:03:42,940 --> 00:03:46,900 And since the table size changes, the index calculated 77 00:03:46,900 --> 00:03:49,390 from our hash function will change for each item, 78 00:03:49,390 --> 00:03:51,420 hence the term rehashing. 79 00:03:51,420 --> 00:03:53,690 All objects will get a new hash value 80 00:03:53,690 --> 00:03:55,773 when we insert them into the new table. 81 00:03:57,020 --> 00:03:58,980 But then the question becomes 82 00:03:58,980 --> 00:04:01,570 how big should we make our new table? 83 00:04:01,570 --> 00:04:02,890 And as a rule of thumb, 84 00:04:02,890 --> 00:04:05,970 we take the old table size, we double it. 85 00:04:05,970 --> 00:04:08,350 And then we find the next prime number 86 00:04:08,350 --> 00:04:11,033 and use this as a size for our new table. 87 00:04:12,630 --> 00:04:17,260 So for example, if our old size was 11, 88 00:04:17,260 --> 00:04:19,130 we'd take two to get 22. 89 00:04:19,130 --> 00:04:20,893 And the next prime is 23. 90 00:04:22,030 --> 00:04:24,750 If we had a table of size 17, 91 00:04:24,750 --> 00:04:28,490 and we hit our threshold and we needed to rehash, 92 00:04:28,490 --> 00:04:31,600 we would multiply that by two to get 34. 93 00:04:31,600 --> 00:04:33,170 The next prime is 37. 94 00:04:33,170 --> 00:04:35,950 So we'd create a new table of size 37 95 00:04:35,950 --> 00:04:38,783 and rehash all of our elements into that new table. 96 00:04:39,630 --> 00:04:42,113 If our old table were size 21, 97 00:04:43,094 --> 00:04:44,800 we double it to get 42. 98 00:04:44,800 --> 00:04:47,300 43 is the next prime. 99 00:04:47,300 --> 00:04:49,770 If we had a table of 103, 100 00:04:49,770 --> 00:04:53,640 we'd double it to get 206, and the next prime would be 211. 101 00:04:53,640 --> 00:04:56,883 So our resulting table would be of size 211. 102 00:04:58,100 --> 00:04:59,463 Let's look at an example. 103 00:05:00,690 --> 00:05:02,960 This should look pretty familiar to you by now. 104 00:05:02,960 --> 00:05:07,120 We have a hash table of size seven. 105 00:05:07,120 --> 00:05:10,150 We're using, again for demonstration purposes, 106 00:05:10,150 --> 00:05:12,740 the simplest possible hash function 107 00:05:12,740 --> 00:05:15,280 which is F of X equals X mod seven. 108 00:05:15,280 --> 00:05:17,450 That's the size of our table. 109 00:05:17,450 --> 00:05:21,300 And we've set a maximum load factor 110 00:05:21,300 --> 00:05:24,320 or fill percentage of 50%. 111 00:05:24,320 --> 00:05:27,493 Okay? And now let's perform some inserts. 112 00:05:28,920 --> 00:05:30,530 We insert 19. 113 00:05:30,530 --> 00:05:32,910 That goes to position five. 114 00:05:32,910 --> 00:05:35,863 The load factor becomes 14.2, that's 1/7. 115 00:05:37,380 --> 00:05:40,180 Okay. We insert eight. 116 00:05:40,180 --> 00:05:41,960 That goes to position one. 117 00:05:41,960 --> 00:05:45,280 Now our load factor becomes 28.6. 118 00:05:47,530 --> 00:05:49,120 We insert a 11. 119 00:05:49,120 --> 00:05:51,650 That goes to index four. 120 00:05:51,650 --> 00:05:54,270 And our load factor becomes 42.9. 121 00:05:54,270 --> 00:05:56,150 And you can see that on the next insert, 122 00:05:56,150 --> 00:05:57,963 we're going to exceed threshold. 123 00:05:58,900 --> 00:05:59,970 So there we go. 124 00:05:59,970 --> 00:06:01,620 We're inserting 17. 125 00:06:01,620 --> 00:06:03,030 That goes at index three. 126 00:06:03,030 --> 00:06:07,040 And it's time to rehash because our load factor 127 00:06:07,040 --> 00:06:08,920 now exceeds the threshold. 128 00:06:08,920 --> 00:06:11,520 57.1 is greater than 50. 129 00:06:11,520 --> 00:06:14,660 Okay. So what do we do? 130 00:06:14,660 --> 00:06:17,340 Well, we have to calculate the size of that table. 131 00:06:17,340 --> 00:06:20,070 So our old table size is seven. 132 00:06:20,070 --> 00:06:22,420 We multiply that by two, we get 14. 133 00:06:22,420 --> 00:06:25,200 And the next prime is 17. 134 00:06:25,200 --> 00:06:29,913 Okay. So now we create a table of size 17, 135 00:06:31,240 --> 00:06:32,210 our hash function, 136 00:06:32,210 --> 00:06:35,310 our minimal hash function changes now. 137 00:06:35,310 --> 00:06:38,470 It's now F of X equals X mod 17 138 00:06:38,470 --> 00:06:43,020 because we have 17 elements in our hash table. 139 00:06:43,020 --> 00:06:45,220 And right, now our load is zero. 140 00:06:45,220 --> 00:06:47,890 But what we're going to do is we're gonna take these items 141 00:06:47,890 --> 00:06:51,320 one at a time and rehash them and insert them 142 00:06:51,320 --> 00:06:52,993 into our new hash table. 143 00:06:53,870 --> 00:06:54,943 So let's do that. 144 00:06:55,890 --> 00:06:57,260 We take eight. 145 00:06:57,260 --> 00:06:58,720 That goes here. 146 00:06:58,720 --> 00:07:02,420 Load factor becomes 5.8%. 147 00:07:02,420 --> 00:07:04,100 Next one's gonna be 17. 148 00:07:04,100 --> 00:07:05,220 You takes 17. 149 00:07:05,220 --> 00:07:07,640 That's gonna land at index zero. 150 00:07:07,640 --> 00:07:11,290 Load factor becomes 11.7%. 151 00:07:12,720 --> 00:07:14,300 We insert 11. 152 00:07:14,300 --> 00:07:15,530 That goes here. 153 00:07:15,530 --> 00:07:19,110 Load factor becomes 17.6%. 154 00:07:20,400 --> 00:07:23,461 And we add 19. 155 00:07:23,461 --> 00:07:24,300 And that goes here. 156 00:07:24,300 --> 00:07:27,410 And our load factor becomes 23.5. 157 00:07:27,410 --> 00:07:30,930 And then when we're done, we just delete that old table. 158 00:07:30,930 --> 00:07:35,300 And now we have a new table that's been rehashed 159 00:07:35,300 --> 00:07:39,170 that now has a load factor well below our threshold. 160 00:07:39,170 --> 00:07:41,123 And we have plenty of room to grow. 161 00:07:43,160 --> 00:07:45,540 So rehashing, the way it works 162 00:07:45,540 --> 00:07:48,350 is that we rehash as needed on inserts 163 00:07:48,350 --> 00:07:51,223 to keep the load factor below a certain threshold. 164 00:07:52,280 --> 00:07:54,360 Typically, we use load factors 165 00:07:54,360 --> 00:07:58,193 of between 50% and 75% as thresholds. 166 00:07:59,200 --> 00:08:02,580 And if we use 50% as our threshold say, 167 00:08:02,580 --> 00:08:05,050 this means that at least half of our hash table 168 00:08:05,050 --> 00:08:07,360 will be empty at any given time. 169 00:08:07,360 --> 00:08:08,490 Now, why is this good? 170 00:08:08,490 --> 00:08:10,610 Well, it solves the problems that we stated 171 00:08:10,610 --> 00:08:12,310 at the very beginning. 172 00:08:12,310 --> 00:08:14,940 It'll reduce the frequency of collisions, 173 00:08:14,940 --> 00:08:17,770 and it'll result in shorter probing sequences. 174 00:08:17,770 --> 00:08:21,053 And so the performance of our hash table will be faster. 175 00:08:22,700 --> 00:08:25,580 Now, what do we need to implement rehashing? 176 00:08:25,580 --> 00:08:28,010 We need a function which when given an integer, 177 00:08:28,010 --> 00:08:30,223 calculates the next prime number. 178 00:08:31,470 --> 00:08:34,163 We need a method to perform rehashing. 179 00:08:35,140 --> 00:08:37,750 And we need to modify the insert function 180 00:08:37,750 --> 00:08:41,650 to check the load factor and to call our rehashing method 181 00:08:41,650 --> 00:08:44,660 as needed whenever we exceed that threshold. 182 00:08:44,660 --> 00:08:47,570 And that's what we'll see in the next video 183 00:08:47,570 --> 00:08:49,940 when we implement a hash table 184 00:08:49,940 --> 00:08:53,380 with linear probing and rehashing. 185 00:08:53,380 --> 00:08:54,330 That's all for now.