1 00:00:00,240 --> 00:00:01,490 - [Instructor] Searching. 2 00:00:02,660 --> 00:00:05,070 We've seen binary search trees earlier, 3 00:00:05,070 --> 00:00:09,970 and related trees, AVL trees, and splay trees. 4 00:00:09,970 --> 00:00:13,210 And these are designed to make search easy. 5 00:00:13,210 --> 00:00:16,340 And these have big O of log n average case 6 00:00:16,340 --> 00:00:18,300 complexity for search. 7 00:00:18,300 --> 00:00:21,680 Here's a quick table that shows the average case 8 00:00:21,680 --> 00:00:23,670 and the worst case. 9 00:00:23,670 --> 00:00:24,970 If you'll recall, 10 00:00:24,970 --> 00:00:28,000 binary search tree has the worst case 11 00:00:28,000 --> 00:00:32,210 of big O of n when we have that pathological structure 12 00:00:32,210 --> 00:00:35,400 that all branches go in one direction. 13 00:00:35,400 --> 00:00:39,440 So we have essentially a linear chain of nodes. 14 00:00:39,440 --> 00:00:43,986 And AVL tree and splay tree are designed to mitigate 15 00:00:43,986 --> 00:00:47,910 that condition to keep the tree balanced 16 00:00:47,910 --> 00:00:50,330 and keep the performance in the range 17 00:00:50,330 --> 00:00:52,170 of O log n. 18 00:00:52,170 --> 00:00:54,300 Splay tree, of course, is amortized 19 00:00:54,300 --> 00:00:56,940 over a number of operations. 20 00:00:56,940 --> 00:00:59,300 And search can be faster 21 00:00:59,300 --> 00:01:02,993 if we repeatedly access the element at the root. 22 00:01:03,980 --> 00:01:06,393 But in the average case, it's big O of log n. 23 00:01:07,530 --> 00:01:09,460 But now the question arises. 24 00:01:09,460 --> 00:01:12,050 What do we do about searching a linear structure? 25 00:01:12,050 --> 00:01:15,250 Often, we're faced with arrays or vectors, 26 00:01:15,250 --> 00:01:18,070 and we want to find whether a value exists 27 00:01:18,070 --> 00:01:22,120 in an array or where, if a particular value exists 28 00:01:22,120 --> 00:01:22,953 in an array. 29 00:01:22,953 --> 00:01:24,690 So we want to be able to return the index 30 00:01:24,690 --> 00:01:25,963 of a particular value. 31 00:01:27,020 --> 00:01:30,180 And so what do we do in cases like this? 32 00:01:30,180 --> 00:01:32,140 If the array is unsorted, 33 00:01:32,140 --> 00:01:34,730 then brute force is the only approach. 34 00:01:34,730 --> 00:01:36,620 We have to check every element, 35 00:01:36,620 --> 00:01:38,320 and that's big O of n. 36 00:01:38,320 --> 00:01:41,480 So if we wanted to find the value 35 here, 37 00:01:41,480 --> 00:01:46,480 we'd have to search each individual value until we found 35. 38 00:01:48,280 --> 00:01:50,113 What if we wanted to find 42? 39 00:01:51,910 --> 00:01:54,560 We'd have to go all the way to the end. 40 00:01:54,560 --> 00:01:57,930 What if we searched for the value 56 41 00:01:57,930 --> 00:02:00,600 which is not in this array? 42 00:02:00,600 --> 00:02:03,040 We'd have to go all the way to the end 43 00:02:03,040 --> 00:02:06,540 to verify that it is not, in fact, in the list. 44 00:02:06,540 --> 00:02:11,100 So that's why we have a complexity of O of n. 45 00:02:11,100 --> 00:02:12,700 That's if the array is unsorted. 46 00:02:14,170 --> 00:02:19,120 Now, you'll recall when we were justifying sorting, 47 00:02:19,120 --> 00:02:22,520 we were saying, it's great because if you sort once, 48 00:02:22,520 --> 00:02:27,080 then you can find values more quickly ever thereafter. 49 00:02:27,080 --> 00:02:29,310 So here, we have the same values 50 00:02:29,310 --> 00:02:32,420 we had in the previous array, but now they're sorted. 51 00:02:32,420 --> 00:02:35,550 So now how can we make search faster? 52 00:02:35,550 --> 00:02:36,900 If the array's sorted, 53 00:02:36,900 --> 00:02:39,200 then we know if we've gone too far. 54 00:02:39,200 --> 00:02:42,183 Okay. So if we're looking for a 64, 55 00:02:43,720 --> 00:02:46,957 and we do 12, 35, 42, 44, 50, 63, 72. 56 00:02:49,390 --> 00:02:54,170 Oops, we've gotten to a value that's bigger than 64. 57 00:02:54,170 --> 00:02:55,650 And we haven't found 64 58 00:02:55,650 --> 00:02:57,653 so we know we can terminate early. 59 00:02:58,740 --> 00:03:01,010 But that's only a small help. 60 00:03:01,010 --> 00:03:04,580 Doing a linear search is still an order of O of n. 61 00:03:04,580 --> 00:03:06,360 And we can do better. 62 00:03:06,360 --> 00:03:07,303 Can you think how? 63 00:03:09,320 --> 00:03:11,590 If you don't already know the answer, 64 00:03:11,590 --> 00:03:12,810 pause the video here 65 00:03:12,810 --> 00:03:15,040 and puzzle over this for a bit. 66 00:03:15,040 --> 00:03:17,810 Let's say we wanted to find 42. 67 00:03:17,810 --> 00:03:20,730 We'd start by finding the middle of our array. 68 00:03:20,730 --> 00:03:24,920 We could take the maximum index 69 00:03:24,920 --> 00:03:28,770 plus the minimum index and integer divide by two. 70 00:03:28,770 --> 00:03:30,600 In this case here, 71 00:03:30,600 --> 00:03:33,610 eight minus zero divided by two gives us four. 72 00:03:33,610 --> 00:03:37,660 So we look for 42 in the middle here. 73 00:03:37,660 --> 00:03:39,180 We find the value 50. 74 00:03:39,180 --> 00:03:43,040 And we know that if 42 is in this array, 75 00:03:43,040 --> 00:03:45,873 it must be to the left. 76 00:03:47,150 --> 00:03:49,120 So we know it's not an index four 77 00:03:49,120 --> 00:03:51,640 since 42 is less than 50. 78 00:03:51,640 --> 00:03:52,760 And the array is sorted, 79 00:03:52,760 --> 00:03:55,070 so we know it must be to the left of 50 80 00:03:55,070 --> 00:03:57,260 if it's in the array at all. 81 00:03:57,260 --> 00:03:59,420 Now we calculate the middle index 82 00:03:59,420 --> 00:04:02,050 of the subarray on the left. 83 00:04:02,050 --> 00:04:06,977 That's three plus zero divided, integer divided by two. 84 00:04:06,977 --> 00:04:08,900 And that gives us one. 85 00:04:08,900 --> 00:04:10,360 And so we go to index one. 86 00:04:10,360 --> 00:04:11,930 Now the value is 35. 87 00:04:11,930 --> 00:04:14,190 But 42 is greater than 35 88 00:04:14,190 --> 00:04:17,400 so we know that it must lie to the right of 35, 89 00:04:17,400 --> 00:04:18,650 excluding the portion 90 00:04:18,650 --> 00:04:21,300 of the array that we've already ruled out. 91 00:04:21,300 --> 00:04:26,300 So again, we calculate, now, indices three plus two. 92 00:04:26,780 --> 00:04:29,313 Integer divided by two gives us two. 93 00:04:30,260 --> 00:04:31,593 And now we found 42. 94 00:04:32,610 --> 00:04:34,210 Notice that at each step, 95 00:04:34,210 --> 00:04:37,173 we cut the size of our search space in half. 96 00:04:38,080 --> 00:04:39,090 Now this should be setting 97 00:04:39,090 --> 00:04:41,740 off bells and whistles in your heads. 98 00:04:41,740 --> 00:04:45,210 What's the maximum number of steps that this kind of search 99 00:04:45,210 --> 00:04:48,480 will need in order to find a value in a sorted array 100 00:04:48,480 --> 00:04:50,493 or determine that it's not in the array? 101 00:04:53,130 --> 00:04:55,200 Well, it's O of log n. 102 00:04:55,200 --> 00:04:57,223 And this is called binary search. 103 00:04:59,330 --> 00:05:00,760 Here's some pseudo code. 104 00:05:00,760 --> 00:05:03,840 This just outlines what I've already walked you through. 105 00:05:03,840 --> 00:05:06,210 We start with the minimum value at zero. 106 00:05:06,210 --> 00:05:09,700 That's the minimum index in our vector. 107 00:05:09,700 --> 00:05:13,570 We set the maximum to be the length of the vector minus one. 108 00:05:13,570 --> 00:05:17,150 That's the largest index in our vector. 109 00:05:17,150 --> 00:05:18,910 And then we have a while loop. 110 00:05:18,910 --> 00:05:21,793 And we simply keep calculating the middle, 111 00:05:22,670 --> 00:05:27,070 and resetting the maximum or the minimum, 112 00:05:27,070 --> 00:05:28,760 depending on whether our target 113 00:05:28,760 --> 00:05:30,780 is less than the value 114 00:05:30,780 --> 00:05:33,420 that we find at the middle or greater 115 00:05:33,420 --> 00:05:36,570 than the value that we find in the middle. 116 00:05:36,570 --> 00:05:38,580 If neither of these cases holds, 117 00:05:38,580 --> 00:05:40,820 then we must've found the value. 118 00:05:40,820 --> 00:05:43,530 So we return the index. 119 00:05:43,530 --> 00:05:46,510 And then, if we get through this entire while loop 120 00:05:47,449 --> 00:05:48,810 and we don't find the value, 121 00:05:48,810 --> 00:05:51,300 then we can return null or some sentinel value. 122 00:05:51,300 --> 00:05:53,810 In the book, they used negative one. 123 00:05:53,810 --> 00:05:55,650 It's a sentinel value to indicate 124 00:05:55,650 --> 00:05:57,290 that your search has failed. 125 00:05:57,290 --> 00:05:59,063 That's a common approach. 126 00:06:00,520 --> 00:06:04,300 So binary search gives us a way to search 127 00:06:04,300 --> 00:06:07,983 through a sorted array in big O of log n. 128 00:06:10,270 --> 00:06:14,680 Here's another approach it's called interpolation search. 129 00:06:14,680 --> 00:06:17,660 Let's say we wanted to find 38. 130 00:06:17,660 --> 00:06:21,160 We know the minimum value in our array is zero. 131 00:06:21,160 --> 00:06:23,020 And the maximum value is 90. 132 00:06:23,020 --> 00:06:25,560 That's a range of 90. 133 00:06:25,560 --> 00:06:29,483 So where would we expect to find 38 in a range like that? 134 00:06:30,420 --> 00:06:32,810 Well, we take 38 divided by 90. 135 00:06:32,810 --> 00:06:34,740 That will give us .42. 136 00:06:34,740 --> 00:06:38,230 And we multiply that by the maximum index 137 00:06:38,230 --> 00:06:41,400 minus the minimum index, which gives us eight. 138 00:06:41,400 --> 00:06:45,500 That gives us a value of 3.37, and we truncate. 139 00:06:45,500 --> 00:06:50,500 And so we would expect to find 38 at index three. 140 00:06:51,010 --> 00:06:54,130 And in fact, that's exactly where it is. 141 00:06:54,130 --> 00:06:56,900 Now this is a nice example with values more 142 00:06:56,900 --> 00:06:59,490 or less uniformly spread out between the minimum 143 00:06:59,490 --> 00:07:00,760 and the maximum. 144 00:07:00,760 --> 00:07:04,100 Notice that they're spread out more or less evenly. 145 00:07:04,100 --> 00:07:06,990 Not exactly, but pretty close. 146 00:07:06,990 --> 00:07:09,510 So we have values more or less uniformly spread out 147 00:07:09,510 --> 00:07:11,650 between the minimum and the maximum. 148 00:07:11,650 --> 00:07:13,320 And if we have an array like this, 149 00:07:13,320 --> 00:07:16,660 then interpolation search works quite nicely. 150 00:07:16,660 --> 00:07:18,393 Let's look at another example. 151 00:07:19,720 --> 00:07:23,143 Here, the values aren't quite so uniformly distributed. 152 00:07:24,200 --> 00:07:26,610 So we start with the same calculation. 153 00:07:26,610 --> 00:07:29,610 The range of values is still from zero to 90. 154 00:07:29,610 --> 00:07:34,270 And we get that .42 when we take this division. 155 00:07:34,270 --> 00:07:39,270 And we multiply by the range of indices here, 156 00:07:39,400 --> 00:07:41,460 and we get 3.37. 157 00:07:41,460 --> 00:07:43,910 And we truncate to three and we've look 158 00:07:43,910 --> 00:07:46,400 in that position in the array. 159 00:07:46,400 --> 00:07:48,623 And we find that 38 is not there. 160 00:07:49,540 --> 00:07:51,610 So now what do we do? 161 00:07:51,610 --> 00:07:53,660 Well, we can exclude all the values 162 00:07:53,660 --> 00:07:56,630 to the left of that position, including that position. 163 00:07:56,630 --> 00:08:00,550 Since we know 38 is greater than 18, 164 00:08:00,550 --> 00:08:02,500 and we know that we have a sorted list. 165 00:08:03,640 --> 00:08:06,930 Well, the first calculation is the same as before. 166 00:08:06,930 --> 00:08:08,990 Now our calculation doesn't simplify. 167 00:08:08,990 --> 00:08:10,800 We were able to simplify before 168 00:08:10,800 --> 00:08:12,780 because our starting point was zero, 169 00:08:12,780 --> 00:08:15,230 and that simplified our calculation a little bit. 170 00:08:16,460 --> 00:08:18,810 But this isn't too much more complicated. 171 00:08:18,810 --> 00:08:21,840 We calculate the expected position of 38 172 00:08:21,840 --> 00:08:25,343 within the range of values from 21 to 90. 173 00:08:26,780 --> 00:08:29,860 And then we multiply that by our range of indices. 174 00:08:29,860 --> 00:08:32,550 That's from four to eight. 175 00:08:32,550 --> 00:08:35,250 And we add the offset of our starting index. 176 00:08:35,250 --> 00:08:36,920 That's four. 177 00:08:36,920 --> 00:08:39,660 So we have 38 minus 21. 178 00:08:39,660 --> 00:08:41,690 That's the 21 is the first value 179 00:08:41,690 --> 00:08:46,450 in our new subarray divided by 90 minus 21. 180 00:08:46,450 --> 00:08:49,480 That's the complete range of values we have here. 181 00:08:49,480 --> 00:08:51,543 And we get, you see, .28. 182 00:08:52,510 --> 00:08:56,600 And then we take .28 times eight minus four. 183 00:08:56,600 --> 00:08:57,433 That's eight. 184 00:08:57,433 --> 00:09:00,690 Here's the largest index, four is the smallest index. 185 00:09:00,690 --> 00:09:03,850 And we add the offset of the smallest index. 186 00:09:03,850 --> 00:09:07,660 And that gives us a value of 5.11. 187 00:09:07,660 --> 00:09:12,380 And so we truncate and we're gonna look at position five. 188 00:09:12,380 --> 00:09:14,370 And there, we find 38. 189 00:09:14,370 --> 00:09:16,920 That's exactly where we expected to find it. 190 00:09:16,920 --> 00:09:18,150 And that's where it was. 191 00:09:18,150 --> 00:09:20,670 And notice that this took two steps. 192 00:09:20,670 --> 00:09:23,363 Whereas binary search would have taken three. 193 00:09:24,260 --> 00:09:25,530 In the case of binary search, 194 00:09:25,530 --> 00:09:28,350 we'd have started at the middle, 21. 195 00:09:28,350 --> 00:09:30,180 And we would have determined 196 00:09:30,180 --> 00:09:31,880 that it was to the right of 21. 197 00:09:31,880 --> 00:09:33,280 We would have found the middle there 198 00:09:33,280 --> 00:09:35,370 which would have been 66, 199 00:09:35,370 --> 00:09:37,590 and then determined that it would have had to have been 200 00:09:37,590 --> 00:09:39,500 to the left of 66. 201 00:09:39,500 --> 00:09:41,900 And so we would have found it on the third step. 202 00:09:43,190 --> 00:09:45,450 So here again, interpolation search 203 00:09:45,450 --> 00:09:47,403 outperforms binary search. 204 00:09:49,300 --> 00:09:52,090 So what's the downside of interpolation search? 205 00:09:52,090 --> 00:09:53,660 Well, in order to work well, 206 00:09:53,660 --> 00:09:57,420 values must be more or less uniformly distributed. 207 00:09:57,420 --> 00:09:59,010 And as the values deviate 208 00:09:59,010 --> 00:10:00,680 from a uniform distribution, 209 00:10:00,680 --> 00:10:03,713 performance of interpolation search deteriorates. 210 00:10:04,810 --> 00:10:07,020 Here's a kind of a pathological example. 211 00:10:07,020 --> 00:10:10,800 This is kind of like our worst case binary search tree. 212 00:10:10,800 --> 00:10:14,023 It's a pathological example, but it illustrates the point. 213 00:10:14,930 --> 00:10:17,990 So interpolation search won't work well 214 00:10:17,990 --> 00:10:19,460 for an array like this. 215 00:10:19,460 --> 00:10:21,223 Say we want to find seven. 216 00:10:22,130 --> 00:10:27,130 Well, the range of values here is zero, 9,999 minus zero. 217 00:10:30,120 --> 00:10:33,730 And so if we calculate the position 218 00:10:33,730 --> 00:10:37,317 of seven divided by 9,999, 219 00:10:39,050 --> 00:10:43,083 that would lead us to index zero and our search would fail. 220 00:10:45,870 --> 00:10:47,430 So we'll try again. 221 00:10:47,430 --> 00:10:49,560 And we calculate the expected position 222 00:10:49,560 --> 00:10:53,640 of seven in the range from one to 9,999. 223 00:10:53,640 --> 00:10:55,533 And we end up at index one, 224 00:10:57,090 --> 00:10:58,330 and that would fail. 225 00:10:58,330 --> 00:11:02,070 And so we do it again, we'd calculate the expected position 226 00:11:02,070 --> 00:11:05,997 of seven and the range from two to 9,999. 227 00:11:05,997 --> 00:11:09,763 And we'd wind up at index two and so on. 228 00:11:14,070 --> 00:11:17,370 So here, our search would be of order O of n. 229 00:11:17,370 --> 00:11:22,010 Notice we had to scan almost the entire array to find seven. 230 00:11:22,010 --> 00:11:24,733 That's a worst case for interpolation search. 231 00:11:28,080 --> 00:11:29,380 Here's some pseudo code. 232 00:11:29,380 --> 00:11:32,800 And all this does is it outlines the procedure 233 00:11:32,800 --> 00:11:35,130 that I've used before. 234 00:11:35,130 --> 00:11:37,220 Notice that there are some edge cases 235 00:11:37,220 --> 00:11:38,800 that are not considered here. 236 00:11:38,800 --> 00:11:42,463 And you'll find a note to that effect in the book as well. 237 00:11:43,340 --> 00:11:47,160 But this is a simplified version of interpolation search. 238 00:11:47,160 --> 00:11:49,520 We initialize the minimum and the maximum. 239 00:11:49,520 --> 00:11:51,370 We have the while loop. 240 00:11:51,370 --> 00:11:56,200 We calculate the expected position 241 00:11:56,200 --> 00:11:59,070 within the range of values. 242 00:11:59,070 --> 00:12:04,070 We apply that to the difference between the maximum 243 00:12:04,650 --> 00:12:07,570 and the minimum indices that we're searching on, 244 00:12:07,570 --> 00:12:10,313 and add the offset of the minimum. 245 00:12:11,340 --> 00:12:14,560 And then again, just as with a binary search, 246 00:12:14,560 --> 00:12:16,920 we update max or min, depending on 247 00:12:16,920 --> 00:12:20,190 whether the target is less than the value 248 00:12:20,190 --> 00:12:22,120 at the position we calculated, 249 00:12:22,120 --> 00:12:23,940 or whether it's greater than 250 00:12:23,940 --> 00:12:26,113 the value at the position we calculated. 251 00:12:27,220 --> 00:12:28,480 And again, if those are equal, 252 00:12:28,480 --> 00:12:31,200 we found it and we returned the index. 253 00:12:31,200 --> 00:12:34,010 Otherwise, we returned null or some sentinel value. 254 00:12:34,010 --> 00:12:36,250 Again, in your book, they use negative one 255 00:12:36,250 --> 00:12:37,503 as the sentinel value. 256 00:12:39,590 --> 00:12:44,590 So in summary, we have a linear search, 257 00:12:45,060 --> 00:12:47,330 which has an average case of O of n, 258 00:12:47,330 --> 00:12:48,880 and a worst case of O of n. 259 00:12:48,880 --> 00:12:50,390 We don't like that. 260 00:12:50,390 --> 00:12:54,630 So if we have a sorted list, we can do binary search 261 00:12:54,630 --> 00:12:56,120 or interpolation search. 262 00:12:56,120 --> 00:13:00,093 Binary search has average and worst case of O of log n. 263 00:13:01,560 --> 00:13:05,090 Interpolation search has an average case 264 00:13:05,090 --> 00:13:07,510 of O of log log n. 265 00:13:07,510 --> 00:13:12,510 And the calculation of that complexity is outside 266 00:13:12,960 --> 00:13:15,713 the scope of the book and this course. 267 00:13:16,610 --> 00:13:19,200 It's a non-trivial calculation. 268 00:13:19,200 --> 00:13:21,110 But notice that the elements 269 00:13:21,110 --> 00:13:22,950 have to be uniformly distributed 270 00:13:22,950 --> 00:13:25,520 in order to achieve that performance. 271 00:13:25,520 --> 00:13:27,720 And the worst case of interpolation search, 272 00:13:27,720 --> 00:13:30,823 as we've just seen is, O of n. 273 00:13:32,220 --> 00:13:35,100 Now this is gonna motivate our next topic 274 00:13:35,100 --> 00:13:37,030 which is called hashing. 275 00:13:37,030 --> 00:13:40,990 And hashing is a way of finding values in constant time. 276 00:13:40,990 --> 00:13:43,010 We'll see more about that later. 277 00:13:43,010 --> 00:13:43,960 That's all for now.