1
00:00:00,680 --> 00:00:02,249
- [Instructor] Okay, now we're gonna look

2
00:00:02,249 --> 00:00:06,010
at our first collision resolution policy,

3
00:00:06,010 --> 00:00:07,830
which is called separate chaining,

4
00:00:07,830 --> 00:00:09,473
or in the book, just chaining.

5
00:00:10,990 --> 00:00:13,620
And you recall from an earlier video,

6
00:00:13,620 --> 00:00:16,350
our introduction to hash
tables left something out,

7
00:00:16,350 --> 00:00:18,370
and we made a note of that fact.

8
00:00:18,370 --> 00:00:21,490
And that is what do we do
when two different keys

9
00:00:21,490 --> 00:00:23,783
yield the same index into our hash table?

10
00:00:24,810 --> 00:00:26,250
This is called hash collision,

11
00:00:26,250 --> 00:00:28,800
and it's something that we need to handle.

12
00:00:28,800 --> 00:00:32,840
We saw in our coding demonstration
of a naive hash table,

13
00:00:32,840 --> 00:00:34,610
that when a collision occurred,

14
00:00:34,610 --> 00:00:37,970
our hash table overwrote
an existing entry.

15
00:00:37,970 --> 00:00:39,300
Remember our dictionary

16
00:00:39,300 --> 00:00:41,563
and the objects for dentist and fish?

17
00:00:42,430 --> 00:00:44,930
Both keys yielded the same index,

18
00:00:44,930 --> 00:00:47,700
and the entry for dentist was overwritten

19
00:00:47,700 --> 00:00:49,330
with the entry of fish.

20
00:00:49,330 --> 00:00:51,630
Obviously, we don't want that to happen.

21
00:00:51,630 --> 00:00:53,250
So, what do we do?

22
00:00:53,250 --> 00:00:56,483
We have to implement some
collision resolution policy.

23
00:00:58,140 --> 00:01:00,550
Now, we've shown that
collisions are inevitable.

24
00:01:00,550 --> 00:01:03,760
So, we're going to take the
example of pigeonholes to heart

25
00:01:03,760 --> 00:01:08,100
in our first collision resolution
policy, separate chaining.

26
00:01:08,100 --> 00:01:10,150
The idea behind separate chaining

27
00:01:10,150 --> 00:01:12,910
is the same as having two or more pigeons

28
00:01:12,910 --> 00:01:14,544
in the same pigeonhole.

29
00:01:14,544 --> 00:01:18,840
That is, instead of
holding just one object,

30
00:01:18,840 --> 00:01:21,300
we allow our elements in our hash table

31
00:01:21,300 --> 00:01:23,103
to hold more than one object.

32
00:01:24,720 --> 00:01:26,400
We saw how to do this, in fact,

33
00:01:26,400 --> 00:01:29,040
when we implemented bucket
sort and and rated sort.

34
00:01:29,040 --> 00:01:31,070
And you recall both of those algorithms

35
00:01:31,070 --> 00:01:34,130
used a data structure that
was a vector of buckets,

36
00:01:34,130 --> 00:01:36,290
that is a vector of vectors.

37
00:01:36,290 --> 00:01:38,670
And we'll use exactly the same structure

38
00:01:38,670 --> 00:01:40,360
for separate chaining.

39
00:01:40,360 --> 00:01:43,490
For an example, let's choose
a small problem instance.

40
00:01:43,490 --> 00:01:45,950
Here, we have a hash table of size seven

41
00:01:45,950 --> 00:01:47,933
with indices from zero to six.

42
00:01:49,770 --> 00:01:52,320
And again, for purposes
of this demonstration,

43
00:01:52,320 --> 00:01:55,330
let's choose the simplest
possible hash function,

44
00:01:55,330 --> 00:01:59,987
f(x) equals X mod 7, where
our inputs are integers.

45
00:01:59,987 --> 00:02:03,840
So, we're gonna take integers
and we're going to insert them

46
00:02:03,840 --> 00:02:06,400
into this hash table using separate chain,

47
00:02:06,400 --> 00:02:08,330
and we'll see how that works.

48
00:02:08,330 --> 00:02:10,450
So, I used a random number generator

49
00:02:10,450 --> 00:02:14,120
to pick random numbers
between zero and 200.

50
00:02:14,120 --> 00:02:15,540
We start with 13.

51
00:02:15,540 --> 00:02:18,350
13 mod 7 is 6.

52
00:02:18,350 --> 00:02:21,460
So, 13 goes in bucket six.

53
00:02:21,460 --> 00:02:22,293
Now, 179.

54
00:02:23,223 --> 00:02:25,627
179 mod 7 is 4.

55
00:02:25,627 --> 00:02:28,693
And so, 179 goes in bucket four.

56
00:02:30,620 --> 00:02:35,080
114, 114 mod 7 is 2.

57
00:02:35,080 --> 00:02:37,293
So, 114 goes in bucket two.

58
00:02:38,160 --> 00:02:40,240
5 mod 7 is 5.

59
00:02:40,240 --> 00:02:42,303
So, that goes a bucket five.

60
00:02:43,470 --> 00:02:45,000
20 mod 7 is 6.

61
00:02:45,000 --> 00:02:47,360
Now, this is our first collision.

62
00:02:47,360 --> 00:02:48,650
Now, what happens?

63
00:02:48,650 --> 00:02:50,720
We're just going to add it to the bucket.

64
00:02:50,720 --> 00:02:51,920
13's already there.

65
00:02:51,920 --> 00:02:52,753
That's okay.

66
00:02:52,753 --> 00:02:54,390
We're not gonna overwrite 13,

67
00:02:54,390 --> 00:02:57,653
like we did in our naive
implementation of hash table.

68
00:02:58,660 --> 00:03:00,607
We're just gonna add it to the bucket.

69
00:03:02,010 --> 00:03:04,340
73 mod 7 is 3.

70
00:03:04,340 --> 00:03:06,003
So, that goes in bucket three.

71
00:03:07,080 --> 00:03:09,840
180 mod 7 equals 5.

72
00:03:09,840 --> 00:03:11,450
So, now we have another collision.

73
00:03:11,450 --> 00:03:13,390
We're just gonna add it to the bucket.

74
00:03:13,390 --> 00:03:16,183
So, now we have 5 and 180 in bucket five.

75
00:03:17,790 --> 00:03:19,740
48 mod 7 is 6,

76
00:03:19,740 --> 00:03:22,890
so we're gonna add another
value to that bucket six.

77
00:03:22,890 --> 00:03:25,270
And there's 48.

78
00:03:25,270 --> 00:03:29,010
46 mod 7 is 4, so we're
gonna add that to the bucket

79
00:03:29,010 --> 00:03:30,890
that already has 179.

80
00:03:30,890 --> 00:03:32,880
There's 46 now.

81
00:03:32,880 --> 00:03:35,850
88 mod 7 is 4, so we're
gonna add that again

82
00:03:35,850 --> 00:03:37,703
to the same bucket, 88.

83
00:03:39,010 --> 00:03:41,897
196 mod 7 is 0.

84
00:03:41,897 --> 00:03:43,553
There's 196.

85
00:03:44,810 --> 00:03:47,130
And that's how separate chaining works.

86
00:03:47,130 --> 00:03:50,510
So, insertion takes constant time,

87
00:03:50,510 --> 00:03:52,640
or it could take constant time

88
00:03:52,640 --> 00:03:55,660
because calculating the
hash takes constant time,

89
00:03:55,660 --> 00:03:58,253
and inserting into a
vector takes constant time.

90
00:03:59,200 --> 00:04:01,880
But, what about duplicate values?

91
00:04:01,880 --> 00:04:04,700
Our hash table becomes
a little less useful

92
00:04:04,700 --> 00:04:06,880
if we allow duplicate values.

93
00:04:06,880 --> 00:04:08,990
And it makes matters messy,

94
00:04:08,990 --> 00:04:11,440
especially if we want to remove values.

95
00:04:11,440 --> 00:04:13,520
Now, we have to remove duplicate values.

96
00:04:13,520 --> 00:04:16,090
So, we don't want duplicate
values in our table.

97
00:04:16,090 --> 00:04:19,226
So, what happens when we
have duplicate values?

98
00:04:19,226 --> 00:04:21,970
Well, we have to search
through the bucket,

99
00:04:21,970 --> 00:04:24,460
and that kills the constant time.

100
00:04:24,460 --> 00:04:28,570
It no longer takes constant
time to insert an object

101
00:04:28,570 --> 00:04:31,750
into a hash table with separate chaining.

102
00:04:31,750 --> 00:04:33,500
We have to search through the bucket first

103
00:04:33,500 --> 00:04:37,260
in order to confirm that that
value doesn't already exist

104
00:04:37,260 --> 00:04:38,870
in the bucket.

105
00:04:38,870 --> 00:04:40,725
What about find and remove?

106
00:04:40,725 --> 00:04:42,190
For find and remove,

107
00:04:42,190 --> 00:04:45,010
we have to search through
the bucket regardless, okay?

108
00:04:45,010 --> 00:04:46,493
So, that's a given.

109
00:04:48,430 --> 00:04:51,460
Now, if we have a table of size b,

110
00:04:51,460 --> 00:04:53,740
where b is the number of buckets,

111
00:04:53,740 --> 00:04:56,790
and we have n objects we wish to store,

112
00:04:56,790 --> 00:05:01,220
then on average, a bucket will
store n divided by b objects.

113
00:05:01,220 --> 00:05:02,683
Simple arithmetic.

114
00:05:04,040 --> 00:05:06,370
And if we use linear
search to check to see

115
00:05:06,370 --> 00:05:08,680
if an object is already in our bucket,

116
00:05:08,680 --> 00:05:10,520
because our buckets aren't sorted,

117
00:05:10,520 --> 00:05:14,730
then that takes big O(n/b) time.

118
00:05:14,730 --> 00:05:16,370
And we also have to
search through a bucket

119
00:05:16,370 --> 00:05:18,510
when finding or removing.

120
00:05:18,510 --> 00:05:21,300
Now, note that the book uses
a linked list for buckets

121
00:05:21,300 --> 00:05:22,490
and here we're using vectors,

122
00:05:22,490 --> 00:05:24,950
but this doesn't change the
fact that in either case

123
00:05:24,950 --> 00:05:26,513
we still need to search.

124
00:05:28,560 --> 00:05:32,500
So, in summary, separate chaining
uses a vector of vectors,

125
00:05:32,500 --> 00:05:35,534
or a vector of linked lists, if you will,

126
00:05:35,534 --> 00:05:37,740
to handle collisions.

127
00:05:37,740 --> 00:05:39,330
Objects with the same index

128
00:05:39,330 --> 00:05:40,880
calculated from the hash function

129
00:05:40,880 --> 00:05:42,390
wind up in the same bucket.

130
00:05:42,390 --> 00:05:44,773
Again, whether it's a
vector or a linked list.

131
00:05:46,010 --> 00:05:50,160
And this requires us to
search on each insertion,

132
00:05:50,160 --> 00:05:52,750
find, or remove operation, okay?

133
00:05:52,750 --> 00:05:55,090
We have to search on insertion

134
00:05:55,090 --> 00:05:57,820
to verify that an element
doesn't already exist

135
00:05:57,820 --> 00:05:59,200
in our bucket.

136
00:05:59,200 --> 00:06:01,010
We always have to search on a find,

137
00:06:01,010 --> 00:06:03,113
and we always have to search on a remove.

138
00:06:04,230 --> 00:06:07,410
But, separate chaining has the benefit

139
00:06:07,410 --> 00:06:09,520
that it's very easy to implement.

140
00:06:09,520 --> 00:06:11,570
And we'll see that in a subsequent video.

141
00:06:13,070 --> 00:06:13,903
Now, before we go,

142
00:06:13,903 --> 00:06:16,200
I'm gonna leave you with
a couple of questions.

143
00:06:17,120 --> 00:06:18,980
First, if we sorted our buckets,

144
00:06:18,980 --> 00:06:21,580
we could improve the search
time to big O(log (n/b))

145
00:06:24,280 --> 00:06:26,280
using binary search.

146
00:06:26,280 --> 00:06:30,317
Or if our data were
suitable, big O(log log(n/b))

147
00:06:32,460 --> 00:06:34,830
using interpolation search.

148
00:06:34,830 --> 00:06:37,250
Does it make sense to do this?

149
00:06:37,250 --> 00:06:38,620
Why or why not?

150
00:06:38,620 --> 00:06:40,270
Give that some thought.

151
00:06:40,270 --> 00:06:42,300
Also, we're gonna look

152
00:06:42,300 --> 00:06:45,543
at a number of other
collision resolution policies.

153
00:06:45,543 --> 00:06:47,440
So, can you think of other ways

154
00:06:47,440 --> 00:06:49,020
that we might handle collisions

155
00:06:49,020 --> 00:06:51,223
that don't require the use of buckets?

156
00:06:53,330 --> 00:06:55,730
I'll leave you with those
two questions for now.

157
00:06:56,840 --> 00:06:57,673
More later.