Binary search is a search algorithm that is dramatically faster than PHP’s internal functions (such as array_search) when searching ordered data.

How does it work?

PHP’s internal function array_search is an example of a linear search; it will iterate over the entire data set until it finds a match working from front to back. This is great if your data set is small and unordered, but is incredibly inefficient when working over large data sets, especially if your match is toward the back of the set, or doesn’t exist at all.

A different approach – divide and conquer

Binary search approaches this problem in a different way. It divides the data set to find the match starting from the middle, to narrow the range that the match can be found within (hence the requirement for your data to be ordered).

Binary search divides the dataset to find the match.
Binary search divides the data set to find the match.

A PHP implementation

Below is a PHP example of how to implement a binary search.

function binarySearch($needle, array $haystack, $compare, $high, $low = 0, $containsDuplicates = false)
{
    $key = false;
    // Whilst we have a range. If not, then that match was not found.
    while ($high >= $low) {
        // Find the middle of the range.
        $mid = (int)floor(($high + $low) / 2);
        // Compare the middle of the range with the needle. This should return <0 if it's in the first part of the range,
        // or >0 if it's in the second part of the range. It will return 0 if there is a match.
        $cmp = call_user_func($compare, $needle, $haystack[$mid]);
        // Adjust the range based on the above logic, so the next loop iteration will use the narrowed range
        if ($cmp < 0) {
            $high = $mid - 1;
        } elseif ($cmp > 0) {
            $low = $mid + 1;
        } else {
            // We've found a match
            if ($containsDuplicates) {
                // Find the first item, if there is a possibility our data set contains duplicates by comparing the
                // previous item with the current item ($mid).
                while ($mid > 0 && call_user_func($compare, $haystack[($mid - 1)], $haystack[$mid]) === 0) {
                    $mid--;
                }
            }
            $key = $mid;
            break;
        }
    }

    return $key;
}

We can utilise this function in the following example by searching an array of email addresses for a specific one. Any test data that appears here was generated by Faker.

$emails = [/* array of emails */];
$searchEmail = '[email protected]';
$key = binarySearch($searchEmail, $emails, 'strcmp', count($emails) - 1, 0, true);

Benchmarks

So let’s benchmark the performance of binary search against PHP’s internal array_search function over a variety of data set sizes and match positions.

Small data set – 100 items

Item exists as the first entry in the data set:

  • PHP’s array_search: 0.02599999999997ms
  • Binary search: 0.018999999999991ms
  • Binary search is 1.37 times faster than array_search

Item exists around the middle of the data set:

  • PHP’s array_search: 0.029999999999974ms
  • Binary search: 0.020999999999993ms
  • Binary search is 1.43 times faster than array_search

Item does not exist in the data set:

  • PHP’s array_search: 0.03000000000003ms
  • Binary search: 0.019000000000047ms
  • Binary search is 1.58 times faster than array_search

Medium data set – 10,000 items

Item exists as the first entry in the data set:

  • PHP’s array_search: 0.032000000000032ms
  • Binary search: 0.023999999999968ms
  • Binary search is 1.33 times faster than array_search

Item exists around the middle of the data set

  • PHP’s array_search: 0.12000000000001ms
  • Binary search: 0.02000000000002ms
  • Binary search is 6 times faster than array_search

Item does not exist in the data set:

  • PHP’s array_search: 0.19099999999994ms
  • Binary search: 0.021999999999966ms
  • Binary search is 8.68 times faster than array_search

Large data set – 1,000,000 items

Item exists as the first entry in the data set:

  • PHP’s array_search: 0.037000000000009ms
  • Binary search: 0.035000000000007ms
  • Binary search is 1.06 times faster than array_search

Item exists around the middle of the data set

  • PHP’s array_search: 8.734ms
  • Binary search: 0.026000000000082ms
  • Binary search is 335.92 times faster than array_search

Item does not exist in the data set:

  • PHP’s array_search: 15.676ms
  • Binary search: 0.031999999999921ms
  • Binary search is 489.87 times faster than array_search

In summary

The results of the benchmarks show that binary search is slightly faster than array_search in most scenarios, but as the data set grows, the performance difference becomes huge. Binary search should be used when you know the data set is large and ordered.

Tagged in:
,