Tuesday, January 19, 2010

The method

BinarySearch.Flowchart.png

In order to discuss the method in detail, a more formal description is necessary. The basic idea is that there is a data structure represented by array A in which individual elements are identified as A(1), A(2),…,A(N) and may be accessed in any order. The data structure contains a sub-element or data field called here Key, and the array is ordered so that the successive values A(1).Key ≤ A(2).Key and so on. The requirement is that given some value x, find an index p (not necessarily the one and only) such that A(p).Key = x.

To begin with, the span to be searched is the full supplied list of elements, as marked by variables L and R, and their values are changed with each iteration of the search process, as depicted by the flowchart. Note that the division by two is integer division, with any remainder lost, so that 3/2 comes out as 1, not 1½. The search finishes either because the value has been found, or else, the specified value is not in the list.

That it works

The method relies on and upholds the notion If x is to be found, it will be amongst elements (L + 1) to (R − 1) of the array.

The initialisation of L and R to 0 and N + 1 make this merely a restatement of the supplied problem, that elements 1 to N are to be searched, so the notion is established to begin with. The first step of each iteration is to check that there is something to search, which is to say whether there are any elements in the search span (L + 1) to (R − 1). The number of such elements is (R − L − 1) so computing (R − L) gives (number of elements + 1); halving that number (with integer division) means that if there was one element (or more) then p = 1 (or more), but if none p = 0, and in that case the method terminates with the report "Not found". Otherwise, for p 0, the search continues with p:=L + p, which by construction is within the bounds (L + 1) to (R − 1). That this position is at or adjacent to the middle of the span is not important here, merely that it is a valid choice.

Now compare x to A(p).Key. If x = A(p).Key then the method terminates in success. Otherwise, suppose x . If so, then because the array is in sorted order, x will also be less than all later elements of the array, all the way to element (R − 1) as well. Accordingly, the value of the right-hand bound index R can be changed to be the value p, since, by the test just made, x and so, if x is to be found, it will be amongst elements earlier than p, that is (p − 1) and earlier. And contrariwise, for the case x A(p).Key, the value of L would be changed. Thus, whichever bound is changed the ruling notion is upheld, and further, the span remaining to be searched is reduced. If L is changed, it is changed to a higher number (at least L + 1), whereas if R is changed, it is to a lower number (at most R − 1) because those are the limits for p.

Should there have been just one value remaining in the search span (so that L + 1 = p = R − 1), and x did not match, then depending on the sign of the comparison either L or R will receive the value of p and at the start of the next iteration the span will be found to be empty.

Accordingly, with each iteration, if the search span is empty the result is "Not found", otherwise either x is found at the probe point p or the search span is reduced for the next iteration. Thus the method works, and so can be called an Algorithm.

That it is fast

With each iteration that fails to find a match at the probed position, the search is continued with one or other of the two sub-intervals, each at most half the size. More precisely, if the number of items, N, is odd then both sub-intervals will contain (N - 1)/2 elements. If N is even then the two sub-intervals contain N/2 - 1 and N/2 elements.

If the original number of items is N then after the first iteration there will be at most N/2 items remaining, then at most N/4 items, at most N/8 items, and so on. In the worst case, when the value is not in the list, the algorithm must continue iterating until the span has been made empty; this will have taken at most ⌊log2(N) + 1⌋ iterations, where the ⌊ ⌋ notation denotes the floor function that rounds its argument down to an integer. This worst case analysis is tight: for any N there exists a query that takes exactly ⌊log2(N) + 1⌋ iterations. When compared to linear search, whose worst-case behaviour is N iterations, we see that binary search is substantially faster as N grows large. For example, to search a list of 1 million items takes as much as 1 million iterations with linear search, but never more than 20 iterations with binary search. However, binary search is only valid if the list is in sorted order.

Average performance

There are two cases: for searches that will fail because the value is not in the list, the search span must be successively halved until no more elements remain and this process will require at most the p probes just defined, or one less. This latter occurs because the search span is not in fact exactly halved, and depending on the value of N and which elements of the list the absent value x is between, the span may be closed early.

For searches that will succeed because the value is in the list, the search may finish early because a probed value happens to match. Loosely speaking, half the time the search will finish one iteration short of the maximum and a quarter of the time, two early. Consider then a test in which a list of N elements is searched once for each of the N values in the list, and determine the number of probes n for all N searches.

  N =  1   2    3    4     5     6     7     8     9    10     11     12     13
n/N = 1 3/2 5/3 8/4 11/5 14/6 17/7 21/8 25/9 29/10 33/11 37/12 41/13
1 1.5 1.66 2 2.2 2.33 2.43 2.63 2.78 2.9 3 3.08 3.15

In short log2(N) − 1 is about the expected number of probes in an average successful search, and the worst case is log2(N), just one more probe. If the list is empty, no probes at all are made.

Suppose the list to be searched contains N even numbers (say, 2,4,6,8 for N = 4) and a search is done for values 1, 2, 3, 4, 5, 6, 7, 8, and 9. The even numbers will be found, and the average number of iterations can be calculated as described. In the case of the odd numbers, they will not be found, and the collection of test values probes every possible position (with regard to the numbers that are in the list) that they might be not found in, and an average is calculated. The maximum value is for each N the greatest number of iterations that were required amongst the various trail searches of those N elements. The first plot shows the iteration counts for N = 1 to 63 (with N = 1, all results are 1), and the second plot is for N = 1 to 32767.


Iteration count for 1 N

The curve for the "found" searches approaches log2(N) − 1 more closely for larger N as larger numbers of iterations are involved, in the same way that the successive summations 1/2, 1/4 + 1/2, 1/8 + 1/4 + 1/2 approach 1 as the number of terms increases: these are the probabilities of early detection of equality in successive iterations of a search. The slight bend in the curves within each iteration limit group is due to the early narrowing of the search bounds into two sub-spans whose lengths are often unequal.


Iteration count for 1 N

Thus binary search is a logarithmic algorithm and executes in O(logN) time. In most cases it is considerably faster than a linear search. It can be implemented using iteration (as shown above), or recursion. In some languages it is more elegantly expressed recursively; however, in some C-based languages tail recursion is not eliminated and the recursive version requires more stack space.

Binary search can interact poorly with the memory hierarchy (i.e. caching), because of its random-access nature. For in-memory searching, if the span to be searched is small, a linear search may have superior performance simply because it exhibits better locality of reference. For external searching, care must be taken or each of the first several probes will lead to a disk seek. A common technique is to abandon binary searching for linear searching as soon as the size of the remaining span falls below a small value such as 8 or 16 or even more in recent computers. The exact value depends entirely on the machine running the algorithm.

Notice that for multiple searches with a fixed value for N, then (with the appropriate regard for integer division), the first iteration always selects the middle element at N/2, and the second always selects either N/4 or 3N/4, and so on. Thus if the array's key values are in some sort of slow storage (on a disc file, in virtual memory, not in the cpu's on-chip memory), keeping those three keys in a local array for a special preliminary search will avoid accessing widely separated memory. Escalating to seven or fifteen such values will allow further levels at not much cost in storage. On the other hand, if the searches are frequent and not separated by much other activity, the computer's various storage control features will more or less automatically promote frequently-accessed elements into faster storage.

When multiple binary searches are to be performed for the same key in related lists, fractional cascading can be used to speed up successive searches after the first one.

No comments:

Post a Comment