1
00:00:01,220 --> 00:00:03,680
- [Instructor] Okay. Now
we've seen separate chaining.

2
00:00:03,680 --> 00:00:07,630
We've seen two collision
resolution policies,

3
00:00:07,630 --> 00:00:10,340
linear probing, and quadratic probing.

4
00:00:10,340 --> 00:00:12,560
Now we're gonna look at one more collision

5
00:00:12,560 --> 00:00:16,010
resolution policy called double hashing.

6
00:00:16,010 --> 00:00:18,730
And double hashing is another approach

7
00:00:18,730 --> 00:00:21,070
to resolving hash collisions.

8
00:00:21,070 --> 00:00:22,530
We've seen that linear probing

9
00:00:22,530 --> 00:00:24,990
is prone to primary clustering.

10
00:00:24,990 --> 00:00:26,810
And while quadratic probing

11
00:00:26,810 --> 00:00:29,240
is designed to eliminate
primary clustering,

12
00:00:29,240 --> 00:00:30,600
we've seen that it's prone

13
00:00:30,600 --> 00:00:32,440
to what's called secondary clustering,

14
00:00:32,440 --> 00:00:33,950
where the clusters are distributed

15
00:00:33,950 --> 00:00:37,210
throughout the table, but
they exist nonetheless.

16
00:00:37,210 --> 00:00:38,950
Double hashing is designed to address

17
00:00:38,950 --> 00:00:40,333
both of these problems.

18
00:00:41,240 --> 00:00:43,790
So let's go back to our
comparison of linear probing

19
00:00:43,790 --> 00:00:44,963
and quadratic probing.

20
00:00:46,890 --> 00:00:49,000
As you recall with linear probing,

21
00:00:49,000 --> 00:00:51,040
we have a fixed stride.

22
00:00:51,040 --> 00:00:52,670
Typically one.

23
00:00:52,670 --> 00:00:54,690
And if we have a collision,

24
00:00:54,690 --> 00:00:57,690
we step from position to position

25
00:00:57,690 --> 00:01:01,670
until we find an open
place to make our insert.

26
00:01:01,670 --> 00:01:04,670
In the case of quadratic
probing, we vary our stride

27
00:01:04,670 --> 00:01:08,280
with each step using the
square of the number of steps.

28
00:01:08,280 --> 00:01:10,120
So the first step is one.

29
00:01:10,120 --> 00:01:11,600
The second step is four.

30
00:01:11,600 --> 00:01:12,960
The third step is nine.

31
00:01:12,960 --> 00:01:15,093
The fourth step is 16 and so on.

32
00:01:17,700 --> 00:01:21,390
Now, in the case of double
hashing, we have a fixed stride

33
00:01:21,390 --> 00:01:25,400
for any given key, but the
stride is calculated separately

34
00:01:25,400 --> 00:01:29,683
for each key using a second
independent hash function.

35
00:01:30,950 --> 00:01:34,170
So two different keys
that yield the same index

36
00:01:34,170 --> 00:01:37,080
is calculated from the
primary hash function,

37
00:01:37,080 --> 00:01:39,000
should have a different stride

38
00:01:39,000 --> 00:01:41,633
as calculated from the
secondary hash function.

39
00:01:42,490 --> 00:01:44,340
And that's how we prevent clustering.

40
00:01:46,520 --> 00:01:49,400
Now, what should the secondary
hash function look like?

41
00:01:49,400 --> 00:01:51,630
Well, it clearly has to be different

42
00:01:51,630 --> 00:01:54,580
from the hash function
used to get the index.

43
00:01:54,580 --> 00:01:57,310
And ideally, the output of
the primary hash function

44
00:01:57,310 --> 00:01:58,920
and the secondary hash function

45
00:01:58,920 --> 00:02:01,440
should be what's called
pairwise independent,

46
00:02:01,440 --> 00:02:03,393
meaning that they're uncorrelated.

47
00:02:04,810 --> 00:02:06,430
The secondary hash function,

48
00:02:06,430 --> 00:02:09,290
unlike the primary hash
function should return values

49
00:02:09,290 --> 00:02:13,050
in the range of one to
table size minus one.

50
00:02:13,050 --> 00:02:15,170
Those would be valid step sizes.

51
00:02:15,170 --> 00:02:17,570
Obviously, we don't want it to return zero

52
00:02:17,570 --> 00:02:20,010
because then it wouldn't probe.

53
00:02:20,010 --> 00:02:23,560
So that's why we have this
slightly different range.

54
00:02:23,560 --> 00:02:26,520
And again, as with our
primary hash function,

55
00:02:26,520 --> 00:02:29,510
the secondary hash function
should distribute values

56
00:02:29,510 --> 00:02:32,273
as uniformly as possible
within that range.

57
00:02:33,260 --> 00:02:36,190
So for example, we could
calculate some number, h,

58
00:02:36,190 --> 00:02:39,470
and we could use Horner's
method or some other method.

59
00:02:39,470 --> 00:02:43,240
And we would return p minus h mod p,

60
00:02:43,240 --> 00:02:47,240
where p is some prime number
less than the table size.

61
00:02:47,240 --> 00:02:49,900
And that's a minimal example

62
00:02:49,900 --> 00:02:51,483
of a secondary hash function.

63
00:02:52,530 --> 00:02:55,580
The primary hash function
gives us the starting point

64
00:02:55,580 --> 00:02:57,150
for our probe sequence.

65
00:02:57,150 --> 00:03:00,480
And the secondary hash
function gives us the stride

66
00:03:00,480 --> 00:03:01,713
if we need to probe.

67
00:03:04,340 --> 00:03:08,530
So here we have a table
showing our minimal

68
00:03:08,530 --> 00:03:10,210
primary hash function.

69
00:03:10,210 --> 00:03:13,990
This is f of x equals x mod 17.

70
00:03:13,990 --> 00:03:17,270
And we have a secondary
hash function, g of x

71
00:03:17,270 --> 00:03:22,270
which is a smaller prime
number, 13 minus x mod 13.

72
00:03:24,920 --> 00:03:29,310
And we see here that
for each different key,

73
00:03:29,310 --> 00:03:33,130
I have shown here in the table,
the primary hash function

74
00:03:33,130 --> 00:03:37,130
and the secondary hash function
yield different values.

75
00:03:37,130 --> 00:03:38,780
They're are pairwise independent.

76
00:03:39,860 --> 00:03:42,820
Okay. And that's an ideal.

77
00:03:42,820 --> 00:03:46,470
Now let's look at what happens
when we have a collision.

78
00:03:46,470 --> 00:03:51,140
Now here, I have all these
keys, zero, 17, 34, 51, 68.

79
00:03:51,140 --> 00:03:53,530
They're all multiples of 17.

80
00:03:53,530 --> 00:03:57,570
So when we apply that to
the primary hash function,

81
00:03:57,570 --> 00:03:59,050
we're always gonna get zero.

82
00:03:59,050 --> 00:04:03,780
So each one of these keys is
going to have a hash collision

83
00:04:03,780 --> 00:04:05,433
in its primary hash function.

84
00:04:06,800 --> 00:04:08,010
But look at what's happening

85
00:04:08,010 --> 00:04:10,680
in the case of the
secondary hash function.

86
00:04:10,680 --> 00:04:14,010
We get different values from
our secondary hash function.

87
00:04:14,010 --> 00:04:16,030
Zero mod 13 is zero.

88
00:04:16,030 --> 00:04:19,180
And so our secondary hash
gives us a stride of 13

89
00:04:19,180 --> 00:04:20,960
when the key is zero.

90
00:04:20,960 --> 00:04:23,920
When the key is 17, we
get a stride of nine.

91
00:04:23,920 --> 00:04:25,957
17 mod 13 is four.

92
00:04:25,957 --> 00:04:28,700
And 13 minus four is nine.

93
00:04:28,700 --> 00:04:32,400
When the key is 34, we
got a stride of five

94
00:04:32,400 --> 00:04:35,420
because 34 mod 13 is eight,

95
00:04:35,420 --> 00:04:38,760
and 13 minus eight is five, and so on.

96
00:04:38,760 --> 00:04:41,210
So you see, even though we have collisions

97
00:04:41,210 --> 00:04:42,750
in the primary key,

98
00:04:42,750 --> 00:04:45,403
we're getting different
values for the secondary key.

99
00:04:46,270 --> 00:04:47,103
So you see,

100
00:04:47,103 --> 00:04:49,810
even when we have a
collision on the insert index

101
00:04:49,810 --> 00:04:52,300
calculated from the primary hash function,

102
00:04:52,300 --> 00:04:55,530
we get a different stride from
the secondary hash function.

103
00:04:55,530 --> 00:04:58,933
And so these probe sequences
will all be different.

104
00:04:59,970 --> 00:05:01,673
Let's take a look at an example.

105
00:05:04,100 --> 00:05:06,540
Here again, we have our
primary hash function,

106
00:05:06,540 --> 00:05:09,330
f of x equals x mod 17,

107
00:05:09,330 --> 00:05:10,960
and our secondary hash function,

108
00:05:10,960 --> 00:05:15,960
g of x equals 13 minus x mod 13.

109
00:05:16,370 --> 00:05:19,570
And it's important that this prime number

110
00:05:19,570 --> 00:05:21,393
be less than the table size.

111
00:05:22,550 --> 00:05:23,890
Okay?

112
00:05:23,890 --> 00:05:26,420
So let's make our first insert.

113
00:05:26,420 --> 00:05:28,410
Let's say we want to insert five.

114
00:05:28,410 --> 00:05:32,470
F of x which is this function
here is gonna give us five,

115
00:05:32,470 --> 00:05:35,090
but g of x is going to give us eight.

116
00:05:35,090 --> 00:05:36,590
Now we don't have anything in our table,

117
00:05:36,590 --> 00:05:37,910
so we're not gonna have a collision.

118
00:05:37,910 --> 00:05:41,560
We're just gonna insert
five at the specified index.

119
00:05:41,560 --> 00:05:42,793
So there we have five.

120
00:05:44,350 --> 00:05:48,170
Now on the next insert, let's
say we want to insert 56.

121
00:05:48,170 --> 00:05:52,900
56 is gonna give us the
same primary hash result.

122
00:05:52,900 --> 00:05:54,513
F of x equals five.

123
00:05:55,490 --> 00:05:57,430
Now g of x,

124
00:05:57,430 --> 00:06:01,120
the secondary hash function
gives us a stride of nine.

125
00:06:01,120 --> 00:06:05,060
So when we insert, we're going
to take a stride of nine,

126
00:06:05,060 --> 00:06:07,920
one, two, three, four,

127
00:06:07,920 --> 00:06:12,280
five, six, seven, eight, nine.

128
00:06:12,280 --> 00:06:14,530
And 56 is gonna go here.

129
00:06:14,530 --> 00:06:16,520
Now let's insert 39.

130
00:06:16,520 --> 00:06:19,760
Again, 39, I've set it up this way,

131
00:06:19,760 --> 00:06:21,130
is gonna cause a collision

132
00:06:21,130 --> 00:06:25,870
because f of x 39 mod 17 is five.

133
00:06:25,870 --> 00:06:28,160
But g of x gives us a different value

134
00:06:28,160 --> 00:06:30,030
than we got before, before it was nine.

135
00:06:30,030 --> 00:06:31,620
Now it's 13.

136
00:06:31,620 --> 00:06:35,360
So our probe is gonna be one, two,

137
00:06:35,360 --> 00:06:38,410
three, four, five, six, seven,

138
00:06:38,410 --> 00:06:41,457
eight, nine, 10, 11, 12.

139
00:06:41,457 --> 00:06:44,250
'Cause we wrap around 13.

140
00:06:44,250 --> 00:06:47,740
And so 39 is gonna go here.

141
00:06:47,740 --> 00:06:49,410
Let's do one more.

142
00:06:49,410 --> 00:06:50,670
Here's 22.

143
00:06:50,670 --> 00:06:52,710
Again, f of x gives us five.

144
00:06:52,710 --> 00:06:54,430
That's our primary hash function.

145
00:06:54,430 --> 00:06:57,180
22 mod 17 is five.

146
00:06:57,180 --> 00:06:59,690
But g of x gives us four.

147
00:06:59,690 --> 00:07:04,103
Because 13 minus 22 mod 13 is four.

148
00:07:05,080 --> 00:07:07,450
And so our stride is going to be four,

149
00:07:07,450 --> 00:07:10,670
one, two, three, four.

150
00:07:10,670 --> 00:07:12,393
And 22 goes there.

151
00:07:13,270 --> 00:07:15,223
So what just happened?

152
00:07:16,090 --> 00:07:18,163
Well, we had these keys.

153
00:07:19,000 --> 00:07:23,830
This was the result of our
primary hash function, all five.

154
00:07:23,830 --> 00:07:26,260
And then our secondary
hash function gave us

155
00:07:26,260 --> 00:07:29,630
these values eight, nine, 13, four.

156
00:07:29,630 --> 00:07:31,520
And so each of these inserts

157
00:07:31,520 --> 00:07:34,501
follows a different probe sequence.

158
00:07:34,501 --> 00:07:37,533
That's the idea behind double hashing.

159
00:07:39,180 --> 00:07:42,100
Now what's the probability
of hash collisions

160
00:07:42,100 --> 00:07:43,563
having the same stride?

161
00:07:44,810 --> 00:07:47,690
In order for hash collisions
to have the same stride

162
00:07:47,690 --> 00:07:50,860
for their probe sequence,
both the primary hash function

163
00:07:50,860 --> 00:07:52,520
and the secondary hash function

164
00:07:52,520 --> 00:07:56,290
would have to return the same
value for two different keys.

165
00:07:56,290 --> 00:07:59,630
Now in an ideal world with
perfect hash functions,

166
00:07:59,630 --> 00:08:02,170
the outputs would be
distributed uniformly,

167
00:08:02,170 --> 00:08:04,653
just as if the hash functions were random.

168
00:08:05,500 --> 00:08:09,300
And we'd have something like a one over n

169
00:08:09,300 --> 00:08:11,867
for the likelihood of any given value

170
00:08:11,867 --> 00:08:14,380
from our primary hash function,

171
00:08:14,380 --> 00:08:16,210
times one over n,

172
00:08:16,210 --> 00:08:17,940
the likelihood of any value

173
00:08:17,940 --> 00:08:20,180
from our secondary hash function.

174
00:08:20,180 --> 00:08:22,380
And then the probability of both

175
00:08:22,380 --> 00:08:26,140
of those yielding the
same values would be one

176
00:08:26,140 --> 00:08:27,063
over n squared.

177
00:08:28,010 --> 00:08:30,920
Now, of course, this is a
lower limit on the probability

178
00:08:30,920 --> 00:08:34,330
that hash collisions have the same stride

179
00:08:34,330 --> 00:08:35,590
because in the real world,

180
00:08:35,590 --> 00:08:37,600
our hash functions won't be perfect.

181
00:08:37,600 --> 00:08:39,700
That is, they won't distribute the values

182
00:08:39,700 --> 00:08:43,653
with perfect uniformity, but
we want them to be close.

183
00:08:44,700 --> 00:08:47,860
So in the real world, this
probability will be greater

184
00:08:47,860 --> 00:08:51,730
than one over n squared, but
hopefully not too much greater.

185
00:08:51,730 --> 00:08:53,200
So if we have a hash table

186
00:08:53,200 --> 00:08:55,950
of around a thousand
elements, the probability

187
00:08:55,950 --> 00:08:59,040
of hash collisions having
the same probe sequence

188
00:08:59,040 --> 00:09:01,402
would be around one in a million.

189
00:09:01,402 --> 00:09:02,352
And that's not bad.

190
00:09:04,150 --> 00:09:07,050
So again, to compare our CRPs,

191
00:09:07,050 --> 00:09:10,350
our collision resolution policies,

192
00:09:10,350 --> 00:09:15,350
we see that all of these
probing approaches probe

193
00:09:16,330 --> 00:09:18,260
on a collision.

194
00:09:18,260 --> 00:09:19,890
Linear probing, quadratic, probing,

195
00:09:19,890 --> 00:09:22,090
and double hashing,
whereas separate chaining,

196
00:09:22,090 --> 00:09:23,223
we extend the chain.

197
00:09:24,620 --> 00:09:27,940
Again, all three of
these probing approaches

198
00:09:27,940 --> 00:09:29,910
have a fixed upper limit on the number

199
00:09:29,910 --> 00:09:32,730
of objects that we can
insert into our hash table.

200
00:09:32,730 --> 00:09:35,520
Whereas again, primary
chaining is limited only

201
00:09:35,520 --> 00:09:37,583
by the memory and system constraints.

202
00:09:38,810 --> 00:09:41,700
Linear probing has a fixed
stride, typically one.

203
00:09:41,700 --> 00:09:43,930
And that's always the same stride

204
00:09:43,930 --> 00:09:46,063
whenever we have a hash collision.

205
00:09:47,190 --> 00:09:50,390
Quadratic probing varies
the stride on each step.

206
00:09:50,390 --> 00:09:55,390
We take the square of the
step to get our stride

207
00:09:55,490 --> 00:09:56,743
for any given step.

208
00:09:57,640 --> 00:10:00,190
And double hashing has a fixed stride,

209
00:10:00,190 --> 00:10:02,693
but it's calculated by a second hash.

210
00:10:03,660 --> 00:10:06,710
Linear probing is prone
to primary clustering.

211
00:10:06,710 --> 00:10:10,040
Quadratic probing is prone
to secondary clustering.

212
00:10:10,040 --> 00:10:13,500
Double hashing reduces
clustering substantially.

213
00:10:13,500 --> 00:10:15,560
And of course, in the
case of separate chaining,

214
00:10:15,560 --> 00:10:16,993
clustering does not occur.

215
00:10:19,550 --> 00:10:22,480
So in summary, with double hashing,

216
00:10:22,480 --> 00:10:26,600
like other open addressing
techniques we've seen,

217
00:10:26,600 --> 00:10:29,493
we only allow a single
object at a given index.

218
00:10:30,470 --> 00:10:31,790
When we have a hash collision,

219
00:10:31,790 --> 00:10:34,490
we probe our hash table
one step at a time,

220
00:10:34,490 --> 00:10:36,470
but with a stride that's been calculated

221
00:10:36,470 --> 00:10:38,973
by a second independent hash function.

222
00:10:40,660 --> 00:10:42,900
And because we use a second hash function,

223
00:10:42,900 --> 00:10:45,110
the stride depends on the data

224
00:10:45,110 --> 00:10:46,890
and this makes it very unlikely

225
00:10:46,890 --> 00:10:49,210
that two insertions
with the same hash value

226
00:10:49,210 --> 00:10:52,610
for the first index would
follow the same probe sequence.

227
00:10:52,610 --> 00:10:54,530
Again, they'd have to have in effect,

228
00:10:54,530 --> 00:10:56,883
two concurrent hash collisions.

229
00:10:58,420 --> 00:11:01,700
And finally, double hashing,
as you might expect,

230
00:11:01,700 --> 00:11:03,360
has a fixed limit on the number

231
00:11:03,360 --> 00:11:06,260
of objects that we can
insert into our hash table.

232
00:11:06,260 --> 00:11:08,680
So as with linear probing,

233
00:11:08,680 --> 00:11:11,620
and quadratic probing,
rehashing is gonna be a thing

234
00:11:11,620 --> 00:11:13,590
when we're doing double hashing as well.

235
00:11:13,590 --> 00:11:16,493
So now I'll leave you with some questions.

236
00:11:18,400 --> 00:11:20,090
We have two basic strategies

237
00:11:20,090 --> 00:11:23,200
for hash collision, chaining and probing.

238
00:11:23,200 --> 00:11:26,100
Where probing methods are linear probing,

239
00:11:26,100 --> 00:11:28,790
quadratic probing, and double hashing.

240
00:11:28,790 --> 00:11:31,093
Which do you think uses more memory?

241
00:11:32,570 --> 00:11:34,933
Which of these do you think is faster?

242
00:11:36,760 --> 00:11:39,363
And how would you calculate
their complexities?

243
00:11:40,440 --> 00:11:42,670
Give these questions some thought

244
00:11:42,670 --> 00:11:44,760
and we'll see you in a later video.

245
00:11:44,760 --> 00:11:45,760
That's all for now.

246
00:11:45,760 --> 00:11:46,593
Thanks.