1 00:00:00,680 --> 00:00:02,249 - [Instructor] Okay, now we're gonna look 2 00:00:02,249 --> 00:00:06,010 at our first collision resolution policy, 3 00:00:06,010 --> 00:00:07,830 which is called separate chaining, 4 00:00:07,830 --> 00:00:09,473 or in the book, just chaining. 5 00:00:10,990 --> 00:00:13,620 And you recall from an earlier video, 6 00:00:13,620 --> 00:00:16,350 our introduction to hash tables left something out, 7 00:00:16,350 --> 00:00:18,370 and we made a note of that fact. 8 00:00:18,370 --> 00:00:21,490 And that is what do we do when two different keys 9 00:00:21,490 --> 00:00:23,783 yield the same index into our hash table? 10 00:00:24,810 --> 00:00:26,250 This is called hash collision, 11 00:00:26,250 --> 00:00:28,800 and it's something that we need to handle. 12 00:00:28,800 --> 00:00:32,840 We saw in our coding demonstration of a naive hash table, 13 00:00:32,840 --> 00:00:34,610 that when a collision occurred, 14 00:00:34,610 --> 00:00:37,970 our hash table overwrote an existing entry. 15 00:00:37,970 --> 00:00:39,300 Remember our dictionary 16 00:00:39,300 --> 00:00:41,563 and the objects for dentist and fish? 17 00:00:42,430 --> 00:00:44,930 Both keys yielded the same index, 18 00:00:44,930 --> 00:00:47,700 and the entry for dentist was overwritten 19 00:00:47,700 --> 00:00:49,330 with the entry of fish. 20 00:00:49,330 --> 00:00:51,630 Obviously, we don't want that to happen. 21 00:00:51,630 --> 00:00:53,250 So, what do we do? 22 00:00:53,250 --> 00:00:56,483 We have to implement some collision resolution policy. 23 00:00:58,140 --> 00:01:00,550 Now, we've shown that collisions are inevitable. 24 00:01:00,550 --> 00:01:03,760 So, we're going to take the example of pigeonholes to heart 25 00:01:03,760 --> 00:01:08,100 in our first collision resolution policy, separate chaining. 26 00:01:08,100 --> 00:01:10,150 The idea behind separate chaining 27 00:01:10,150 --> 00:01:12,910 is the same as having two or more pigeons 28 00:01:12,910 --> 00:01:14,544 in the same pigeonhole. 29 00:01:14,544 --> 00:01:18,840 That is, instead of holding just one object, 30 00:01:18,840 --> 00:01:21,300 we allow our elements in our hash table 31 00:01:21,300 --> 00:01:23,103 to hold more than one object. 32 00:01:24,720 --> 00:01:26,400 We saw how to do this, in fact, 33 00:01:26,400 --> 00:01:29,040 when we implemented bucket sort and and rated sort. 34 00:01:29,040 --> 00:01:31,070 And you recall both of those algorithms 35 00:01:31,070 --> 00:01:34,130 used a data structure that was a vector of buckets, 36 00:01:34,130 --> 00:01:36,290 that is a vector of vectors. 37 00:01:36,290 --> 00:01:38,670 And we'll use exactly the same structure 38 00:01:38,670 --> 00:01:40,360 for separate chaining. 39 00:01:40,360 --> 00:01:43,490 For an example, let's choose a small problem instance. 40 00:01:43,490 --> 00:01:45,950 Here, we have a hash table of size seven 41 00:01:45,950 --> 00:01:47,933 with indices from zero to six. 42 00:01:49,770 --> 00:01:52,320 And again, for purposes of this demonstration, 43 00:01:52,320 --> 00:01:55,330 let's choose the simplest possible hash function, 44 00:01:55,330 --> 00:01:59,987 f(x) equals X mod 7, where our inputs are integers. 45 00:01:59,987 --> 00:02:03,840 So, we're gonna take integers and we're going to insert them 46 00:02:03,840 --> 00:02:06,400 into this hash table using separate chain, 47 00:02:06,400 --> 00:02:08,330 and we'll see how that works. 48 00:02:08,330 --> 00:02:10,450 So, I used a random number generator 49 00:02:10,450 --> 00:02:14,120 to pick random numbers between zero and 200. 50 00:02:14,120 --> 00:02:15,540 We start with 13. 51 00:02:15,540 --> 00:02:18,350 13 mod 7 is 6. 52 00:02:18,350 --> 00:02:21,460 So, 13 goes in bucket six. 53 00:02:21,460 --> 00:02:22,293 Now, 179. 54 00:02:23,223 --> 00:02:25,627 179 mod 7 is 4. 55 00:02:25,627 --> 00:02:28,693 And so, 179 goes in bucket four. 56 00:02:30,620 --> 00:02:35,080 114, 114 mod 7 is 2. 57 00:02:35,080 --> 00:02:37,293 So, 114 goes in bucket two. 58 00:02:38,160 --> 00:02:40,240 5 mod 7 is 5. 59 00:02:40,240 --> 00:02:42,303 So, that goes a bucket five. 60 00:02:43,470 --> 00:02:45,000 20 mod 7 is 6. 61 00:02:45,000 --> 00:02:47,360 Now, this is our first collision. 62 00:02:47,360 --> 00:02:48,650 Now, what happens? 63 00:02:48,650 --> 00:02:50,720 We're just going to add it to the bucket. 64 00:02:50,720 --> 00:02:51,920 13's already there. 65 00:02:51,920 --> 00:02:52,753 That's okay. 66 00:02:52,753 --> 00:02:54,390 We're not gonna overwrite 13, 67 00:02:54,390 --> 00:02:57,653 like we did in our naive implementation of hash table. 68 00:02:58,660 --> 00:03:00,607 We're just gonna add it to the bucket. 69 00:03:02,010 --> 00:03:04,340 73 mod 7 is 3. 70 00:03:04,340 --> 00:03:06,003 So, that goes in bucket three. 71 00:03:07,080 --> 00:03:09,840 180 mod 7 equals 5. 72 00:03:09,840 --> 00:03:11,450 So, now we have another collision. 73 00:03:11,450 --> 00:03:13,390 We're just gonna add it to the bucket. 74 00:03:13,390 --> 00:03:16,183 So, now we have 5 and 180 in bucket five. 75 00:03:17,790 --> 00:03:19,740 48 mod 7 is 6, 76 00:03:19,740 --> 00:03:22,890 so we're gonna add another value to that bucket six. 77 00:03:22,890 --> 00:03:25,270 And there's 48. 78 00:03:25,270 --> 00:03:29,010 46 mod 7 is 4, so we're gonna add that to the bucket 79 00:03:29,010 --> 00:03:30,890 that already has 179. 80 00:03:30,890 --> 00:03:32,880 There's 46 now. 81 00:03:32,880 --> 00:03:35,850 88 mod 7 is 4, so we're gonna add that again 82 00:03:35,850 --> 00:03:37,703 to the same bucket, 88. 83 00:03:39,010 --> 00:03:41,897 196 mod 7 is 0. 84 00:03:41,897 --> 00:03:43,553 There's 196. 85 00:03:44,810 --> 00:03:47,130 And that's how separate chaining works. 86 00:03:47,130 --> 00:03:50,510 So, insertion takes constant time, 87 00:03:50,510 --> 00:03:52,640 or it could take constant time 88 00:03:52,640 --> 00:03:55,660 because calculating the hash takes constant time, 89 00:03:55,660 --> 00:03:58,253 and inserting into a vector takes constant time. 90 00:03:59,200 --> 00:04:01,880 But, what about duplicate values? 91 00:04:01,880 --> 00:04:04,700 Our hash table becomes a little less useful 92 00:04:04,700 --> 00:04:06,880 if we allow duplicate values. 93 00:04:06,880 --> 00:04:08,990 And it makes matters messy, 94 00:04:08,990 --> 00:04:11,440 especially if we want to remove values. 95 00:04:11,440 --> 00:04:13,520 Now, we have to remove duplicate values. 96 00:04:13,520 --> 00:04:16,090 So, we don't want duplicate values in our table. 97 00:04:16,090 --> 00:04:19,226 So, what happens when we have duplicate values? 98 00:04:19,226 --> 00:04:21,970 Well, we have to search through the bucket, 99 00:04:21,970 --> 00:04:24,460 and that kills the constant time. 100 00:04:24,460 --> 00:04:28,570 It no longer takes constant time to insert an object 101 00:04:28,570 --> 00:04:31,750 into a hash table with separate chaining. 102 00:04:31,750 --> 00:04:33,500 We have to search through the bucket first 103 00:04:33,500 --> 00:04:37,260 in order to confirm that that value doesn't already exist 104 00:04:37,260 --> 00:04:38,870 in the bucket. 105 00:04:38,870 --> 00:04:40,725 What about find and remove? 106 00:04:40,725 --> 00:04:42,190 For find and remove, 107 00:04:42,190 --> 00:04:45,010 we have to search through the bucket regardless, okay? 108 00:04:45,010 --> 00:04:46,493 So, that's a given. 109 00:04:48,430 --> 00:04:51,460 Now, if we have a table of size b, 110 00:04:51,460 --> 00:04:53,740 where b is the number of buckets, 111 00:04:53,740 --> 00:04:56,790 and we have n objects we wish to store, 112 00:04:56,790 --> 00:05:01,220 then on average, a bucket will store n divided by b objects. 113 00:05:01,220 --> 00:05:02,683 Simple arithmetic. 114 00:05:04,040 --> 00:05:06,370 And if we use linear search to check to see 115 00:05:06,370 --> 00:05:08,680 if an object is already in our bucket, 116 00:05:08,680 --> 00:05:10,520 because our buckets aren't sorted, 117 00:05:10,520 --> 00:05:14,730 then that takes big O(n/b) time. 118 00:05:14,730 --> 00:05:16,370 And we also have to search through a bucket 119 00:05:16,370 --> 00:05:18,510 when finding or removing. 120 00:05:18,510 --> 00:05:21,300 Now, note that the book uses a linked list for buckets 121 00:05:21,300 --> 00:05:22,490 and here we're using vectors, 122 00:05:22,490 --> 00:05:24,950 but this doesn't change the fact that in either case 123 00:05:24,950 --> 00:05:26,513 we still need to search. 124 00:05:28,560 --> 00:05:32,500 So, in summary, separate chaining uses a vector of vectors, 125 00:05:32,500 --> 00:05:35,534 or a vector of linked lists, if you will, 126 00:05:35,534 --> 00:05:37,740 to handle collisions. 127 00:05:37,740 --> 00:05:39,330 Objects with the same index 128 00:05:39,330 --> 00:05:40,880 calculated from the hash function 129 00:05:40,880 --> 00:05:42,390 wind up in the same bucket. 130 00:05:42,390 --> 00:05:44,773 Again, whether it's a vector or a linked list. 131 00:05:46,010 --> 00:05:50,160 And this requires us to search on each insertion, 132 00:05:50,160 --> 00:05:52,750 find, or remove operation, okay? 133 00:05:52,750 --> 00:05:55,090 We have to search on insertion 134 00:05:55,090 --> 00:05:57,820 to verify that an element doesn't already exist 135 00:05:57,820 --> 00:05:59,200 in our bucket. 136 00:05:59,200 --> 00:06:01,010 We always have to search on a find, 137 00:06:01,010 --> 00:06:03,113 and we always have to search on a remove. 138 00:06:04,230 --> 00:06:07,410 But, separate chaining has the benefit 139 00:06:07,410 --> 00:06:09,520 that it's very easy to implement. 140 00:06:09,520 --> 00:06:11,570 And we'll see that in a subsequent video. 141 00:06:13,070 --> 00:06:13,903 Now, before we go, 142 00:06:13,903 --> 00:06:16,200 I'm gonna leave you with a couple of questions. 143 00:06:17,120 --> 00:06:18,980 First, if we sorted our buckets, 144 00:06:18,980 --> 00:06:21,580 we could improve the search time to big O(log (n/b)) 145 00:06:24,280 --> 00:06:26,280 using binary search. 146 00:06:26,280 --> 00:06:30,317 Or if our data were suitable, big O(log log(n/b)) 147 00:06:32,460 --> 00:06:34,830 using interpolation search. 148 00:06:34,830 --> 00:06:37,250 Does it make sense to do this? 149 00:06:37,250 --> 00:06:38,620 Why or why not? 150 00:06:38,620 --> 00:06:40,270 Give that some thought. 151 00:06:40,270 --> 00:06:42,300 Also, we're gonna look 152 00:06:42,300 --> 00:06:45,543 at a number of other collision resolution policies. 153 00:06:45,543 --> 00:06:47,440 So, can you think of other ways 154 00:06:47,440 --> 00:06:49,020 that we might handle collisions 155 00:06:49,020 --> 00:06:51,223 that don't require the use of buckets? 156 00:06:53,330 --> 00:06:55,730 I'll leave you with those two questions for now. 157 00:06:56,840 --> 00:06:57,673 More later.