1 00:00:01,260 --> 00:00:02,370 - [Instructor] Now we're gonna continue 2 00:00:02,370 --> 00:00:03,930 our study of hash tables 3 00:00:03,930 --> 00:00:07,400 and look at another collision resolution policy 4 00:00:07,400 --> 00:00:08,933 called linear probing. 5 00:00:10,250 --> 00:00:13,660 Earlier, we saw our first collision resolution policy, 6 00:00:13,660 --> 00:00:15,490 separate chaining. 7 00:00:15,490 --> 00:00:18,030 Now linear probing is another approach 8 00:00:18,030 --> 00:00:19,913 to resolving hash collisions. 9 00:00:20,780 --> 00:00:22,630 Unlike separate chaining, 10 00:00:22,630 --> 00:00:25,800 however, we only allow a single object 11 00:00:25,800 --> 00:00:28,173 at a given index in our hash table. 12 00:00:29,080 --> 00:00:30,363 So how does it work? 13 00:00:31,570 --> 00:00:34,980 Well, the idea behind linear probing is simple. 14 00:00:34,980 --> 00:00:36,830 If a collision occurs, 15 00:00:36,830 --> 00:00:40,450 we probe our hash table taking one step at a time 16 00:00:40,450 --> 00:00:42,470 until we find an empty spot 17 00:00:42,470 --> 00:00:44,453 for the object we wish to insert. 18 00:00:45,320 --> 00:00:47,600 We call the step size the stride 19 00:00:47,600 --> 00:00:49,750 and it's typically set to one, 20 00:00:49,750 --> 00:00:52,930 but basically linear probing is just what you might think 21 00:00:52,930 --> 00:00:54,050 given its name. 22 00:00:54,050 --> 00:00:55,160 On collisions, 23 00:00:55,160 --> 00:00:58,520 we probe in a linear step-wise fashion 24 00:00:58,520 --> 00:01:01,210 until we find a home for our new object. 25 00:01:01,210 --> 00:01:02,923 Let's look at a simple example. 26 00:01:03,890 --> 00:01:07,070 Here's an example similar to the one we saw earlier. 27 00:01:07,070 --> 00:01:10,020 Again, we have a hash table of size seven. 28 00:01:10,020 --> 00:01:13,090 But in this case, unlike separate chaining, 29 00:01:13,090 --> 00:01:16,753 each element in the hash table can hold only one object. 30 00:01:18,220 --> 00:01:19,130 As before, 31 00:01:19,130 --> 00:01:21,980 we're gonna use the simplest possible hash function 32 00:01:21,980 --> 00:01:24,080 for demonstration purposes, 33 00:01:24,080 --> 00:01:26,343 f(x) equals x mod seven. 34 00:01:27,270 --> 00:01:29,850 So let's insert 12, 35 00:01:29,850 --> 00:01:32,950 12 mod seven equals five. 36 00:01:32,950 --> 00:01:34,950 So 12 goes there. 37 00:01:34,950 --> 00:01:35,973 So far so good. 38 00:01:37,430 --> 00:01:39,420 Now we'll insert nine, 39 00:01:39,420 --> 00:01:41,223 nine mod seven is two. 40 00:01:42,060 --> 00:01:44,023 So nine is gonna go here. 41 00:01:45,330 --> 00:01:48,320 But what happens when we want to insert 44? 42 00:01:48,320 --> 00:01:51,030 44 mod seven is also two. 43 00:01:51,030 --> 00:01:52,363 So we have a collision. 44 00:01:53,980 --> 00:01:55,550 Where does 44 go? 45 00:01:55,550 --> 00:01:57,800 Well, we start probing at two. 46 00:01:57,800 --> 00:01:59,870 We found that there's a collision. 47 00:01:59,870 --> 00:02:03,470 So we take one step to the right, to three, 48 00:02:03,470 --> 00:02:07,323 and we insert 44 in the first open position we find. 49 00:02:08,200 --> 00:02:09,700 That's linear probing. 50 00:02:09,700 --> 00:02:10,653 Let's continue. 51 00:02:11,860 --> 00:02:13,600 If we want to insert 30, 52 00:02:13,600 --> 00:02:15,750 30 mod seven is also two. 53 00:02:15,750 --> 00:02:17,950 So we have a collision again. 54 00:02:17,950 --> 00:02:19,070 What do we do? 55 00:02:19,070 --> 00:02:22,000 Well, we start our probe sequence at two, 56 00:02:22,000 --> 00:02:24,600 and we continue until we find an empty position 57 00:02:24,600 --> 00:02:26,590 in which to insert our object. 58 00:02:26,590 --> 00:02:28,663 So 30 goes at index four. 59 00:02:30,670 --> 00:02:32,200 Now we want to insert two. 60 00:02:32,200 --> 00:02:34,470 Again, two mod seven is two, 61 00:02:34,470 --> 00:02:36,240 another collision. 62 00:02:36,240 --> 00:02:37,690 So what do we do? 63 00:02:37,690 --> 00:02:40,530 We probe until we find an empty spot. 64 00:02:40,530 --> 00:02:42,353 Now two goes at index six. 65 00:02:44,360 --> 00:02:46,600 Now we want to insert 16. 66 00:02:46,600 --> 00:02:48,100 Again, we have a collision 67 00:02:48,100 --> 00:02:50,160 because 16 mod seven is two. 68 00:02:50,160 --> 00:02:50,993 Now what do we do? 69 00:02:50,993 --> 00:02:54,110 We've run out of positions in our index. 70 00:02:54,110 --> 00:02:55,440 Do you think? 71 00:02:55,440 --> 00:02:57,240 No, I think we wrap around. 72 00:02:57,240 --> 00:02:58,623 So watch what happens. 73 00:02:59,540 --> 00:03:01,100 We get to the end of the vector 74 00:03:01,100 --> 00:03:02,850 and we wrap around to the beginning. 75 00:03:02,850 --> 00:03:06,623 So 16 is inserted at position zero. 76 00:03:08,430 --> 00:03:10,530 Now this sequence of steps that we take 77 00:03:10,530 --> 00:03:12,303 is called a probe sequence. 78 00:03:13,260 --> 00:03:14,460 So in the case of nine, 79 00:03:14,460 --> 00:03:16,223 our probe sequence was just two. 80 00:03:17,060 --> 00:03:20,610 In the case of 44, we had to probe two and three. 81 00:03:20,610 --> 00:03:23,720 For 30, we had to probe two, three, and four. 82 00:03:23,720 --> 00:03:27,860 For two, we had to probe two, three, four, five, and six. 83 00:03:27,860 --> 00:03:32,160 And for 16, we probed all the way back around to zero again. 84 00:03:32,160 --> 00:03:36,140 So now what happens when we want to find an element? 85 00:03:36,140 --> 00:03:40,060 Well, we just follow the same probe sequence 86 00:03:40,060 --> 00:03:41,680 until we find our target 87 00:03:41,680 --> 00:03:44,930 or we encounter an empty position in our hash table, 88 00:03:44,930 --> 00:03:46,560 at which point we'll know that the target 89 00:03:46,560 --> 00:03:48,253 doesn't exist in the hash table. 90 00:03:50,060 --> 00:03:52,600 So for example, to find 30, 91 00:03:52,600 --> 00:03:54,670 we calculate the hash in two. 92 00:03:54,670 --> 00:03:56,820 We start at index two, 93 00:03:56,820 --> 00:03:58,310 we find that it's not there, 94 00:03:58,310 --> 00:04:00,563 and so we probe until we find 30. 95 00:04:01,520 --> 00:04:05,860 Now 72 is not in this hash table. 96 00:04:05,860 --> 00:04:07,650 So what happens then? 97 00:04:07,650 --> 00:04:09,090 Well, we calculate the hash, 98 00:04:09,090 --> 00:04:10,100 again, we get two, 99 00:04:10,100 --> 00:04:11,570 we start at two, 100 00:04:11,570 --> 00:04:14,040 and we continue our linear probe 101 00:04:14,040 --> 00:04:17,100 until we arrive at an empty position. 102 00:04:17,100 --> 00:04:18,770 Once we hit that empty position, 103 00:04:18,770 --> 00:04:22,070 then we know that 72 is not in the hash table. 104 00:04:22,070 --> 00:04:23,783 So that's how find works. 105 00:04:25,480 --> 00:04:26,990 Now what about removing? 106 00:04:26,990 --> 00:04:30,200 Well, let's say we want to remove 16. 107 00:04:30,200 --> 00:04:31,200 Calculate that hash, 108 00:04:31,200 --> 00:04:32,100 we start at two, 109 00:04:32,100 --> 00:04:35,940 we probe in a linear fashion until we get to 16, 110 00:04:35,940 --> 00:04:38,833 we match that value and then we delete it. 111 00:04:39,980 --> 00:04:41,053 So far so good. 112 00:04:42,290 --> 00:04:45,400 Now what happens if we delete 30? 113 00:04:45,400 --> 00:04:47,830 Well, we start at two, 114 00:04:47,830 --> 00:04:48,970 we take two steps, 115 00:04:48,970 --> 00:04:51,703 we find 30 and we delete it. 116 00:04:52,980 --> 00:04:54,950 But now we got a problem. 117 00:04:54,950 --> 00:04:56,433 Can you see what it is? 118 00:04:57,530 --> 00:05:00,760 What happens if we try to find two now? 119 00:05:00,760 --> 00:05:02,420 If we try to find two, 120 00:05:02,420 --> 00:05:04,140 we're gonna start at index two 121 00:05:04,140 --> 00:05:07,960 because that's the value returned by our hash function. 122 00:05:07,960 --> 00:05:12,960 And then we probe twice and we arrive at an empty position. 123 00:05:12,970 --> 00:05:16,770 And so now we think two's not in our table. 124 00:05:16,770 --> 00:05:21,130 But it is in fact in our table at index six here. 125 00:05:21,130 --> 00:05:24,080 And by deleting 30, 126 00:05:24,080 --> 00:05:26,970 we broke our probe sequence, 127 00:05:26,970 --> 00:05:27,970 and that's not good. 128 00:05:29,090 --> 00:05:33,290 So what we do in cases of deleting 129 00:05:33,290 --> 00:05:36,230 is rather than just remove the item 130 00:05:36,230 --> 00:05:39,943 and return the position to an empty state, 131 00:05:41,070 --> 00:05:43,240 we have another state in our table 132 00:05:43,240 --> 00:05:47,433 that indicates an item has been deleted or removed. 133 00:05:48,400 --> 00:05:52,233 And in this case, it preserves our probe sequence. 134 00:05:53,210 --> 00:05:55,860 So now when we want to find two, 135 00:05:55,860 --> 00:05:58,370 we don't stop at four thinking we haven't found it, 136 00:05:58,370 --> 00:06:02,300 we can continue on our way until we find two. 137 00:06:02,300 --> 00:06:05,270 So you'll see more about how this works 138 00:06:05,270 --> 00:06:08,010 when we implement this in C++. 139 00:06:08,010 --> 00:06:11,170 But be aware that we need to ensure 140 00:06:11,170 --> 00:06:13,433 that our probe sequence isn't broken. 141 00:06:16,860 --> 00:06:19,690 Linear probing is prone to a phenomenon 142 00:06:19,690 --> 00:06:22,080 called primary clustering. 143 00:06:22,080 --> 00:06:25,610 And that's because every time there's a hash collision, 144 00:06:25,610 --> 00:06:27,370 we just look in the next position, 145 00:06:27,370 --> 00:06:29,570 and then the next, and then the next. 146 00:06:29,570 --> 00:06:33,060 And we tend to fill up blocks of contiguous positions 147 00:06:33,060 --> 00:06:34,610 in our hash table. 148 00:06:34,610 --> 00:06:36,053 That's primary clustering. 149 00:06:37,810 --> 00:06:41,180 The downside, when we have primary clustering, 150 00:06:41,180 --> 00:06:43,760 is we have to keep probing again and again 151 00:06:43,760 --> 00:06:47,430 and again and again until we find an empty spot. 152 00:06:47,430 --> 00:06:50,900 So that's the problem that we'll address a little later. 153 00:06:50,900 --> 00:06:54,103 But you should be aware that it occurs with linear probing. 154 00:06:55,260 --> 00:06:58,460 So let's compare linear probing with separate chaining. 155 00:06:58,460 --> 00:07:02,630 In the case of linear probing, on collisions, we probe. 156 00:07:02,630 --> 00:07:03,760 With separate chaining, 157 00:07:03,760 --> 00:07:05,920 on collisions, we just extend the chain. 158 00:07:05,920 --> 00:07:07,573 We add one more to the bucket. 159 00:07:08,840 --> 00:07:10,520 In the case of linear probing, 160 00:07:10,520 --> 00:07:12,080 there's a fixed upper limit 161 00:07:12,080 --> 00:07:14,620 on the number of objects we can insert. 162 00:07:14,620 --> 00:07:17,240 That's the size of our hash table. 163 00:07:17,240 --> 00:07:18,770 We don't have that restriction 164 00:07:18,770 --> 00:07:20,970 in the case of separate chaining, 165 00:07:20,970 --> 00:07:23,880 because we can extend the chains. 166 00:07:23,880 --> 00:07:27,400 Linear probing again is prone to primary clustering 167 00:07:27,400 --> 00:07:28,880 but with separate chaining, 168 00:07:28,880 --> 00:07:30,340 clustering does not occur 169 00:07:30,340 --> 00:07:32,090 since we're not doing that probing. 170 00:07:34,090 --> 00:07:34,923 In summary, 171 00:07:34,923 --> 00:07:37,000 with a linear probing scheme, 172 00:07:37,000 --> 00:07:40,233 we only allow a single object at a given index. 173 00:07:41,090 --> 00:07:42,210 Upon hash collisions, 174 00:07:42,210 --> 00:07:45,170 we probe our hash table one step at a time 175 00:07:45,170 --> 00:07:46,940 until we find an empty position 176 00:07:46,940 --> 00:07:48,340 where we can put our object. 177 00:07:49,730 --> 00:07:51,050 And unlike separate chaining 178 00:07:51,050 --> 00:07:53,180 where we can extend our chains, 179 00:07:53,180 --> 00:07:55,360 linear probing has a fixed upper limit 180 00:07:55,360 --> 00:07:56,410 on the number of objects 181 00:07:56,410 --> 00:07:58,373 we can insert into our hash table. 182 00:07:59,610 --> 00:08:00,530 So before we finish, 183 00:08:00,530 --> 00:08:02,880 I'm gonna leave you with a couple of questions. 184 00:08:03,750 --> 00:08:05,950 Because we have this upper limit, 185 00:08:05,950 --> 00:08:07,380 the size of our hash table, 186 00:08:07,380 --> 00:08:11,280 what can we do when we run out of space in our hash table? 187 00:08:11,280 --> 00:08:12,883 We'll see about that shortly. 188 00:08:14,190 --> 00:08:16,350 And then another question. 189 00:08:16,350 --> 00:08:19,480 If we set our stride to some value greater than one, 190 00:08:19,480 --> 00:08:21,390 say two or three, 191 00:08:21,390 --> 00:08:24,710 why is it a good idea to have a hash table size 192 00:08:24,710 --> 00:08:26,123 that's a prime number? 193 00:08:27,260 --> 00:08:29,320 Give that some thought, too. 194 00:08:29,320 --> 00:08:30,870 I'll see you in the next video.