Wednesday, February 11, 2015

Binary search - is it still most optimal?

If you have a sorted collection of elements, how would you find index of specific value?
"Binary search" is likely to be your answer.
Algorithms theory is teaching us what binary search is most optimal algorithm for this task with log(N) complexity.
Well, hash table can do better, if you need to find key by exact match. In many cases, though, you have reasons to have your collection sorted, not hashed.

On my job, I'm working on sophisticated in-memory database tailored for streaming data processing. We have a lot of places where we deal with sorted collection of integers (data row references, etc).

Algorithms theory is good, but in reality there are things like cache hierarchy, branch prediction, super scalar execution which may skew performance at edge cases.

Question is - where lie borders between reality ruled by CPU quirks and lawful space of classic algorithms theory?

If you have a doubt - do an experiment.

Experiment is simple: I'm generating a large number of sorted arrays of 32 bit integers. When I search random key in random array multiple times. In each experiment average size of array is fixed. Large number of arrays used to ensure cold memory access. Average time search time is measured.

All code written in Java and measured using JMH tool.

Participants are

  • Binary search - java.util.Arrays.binarySearch()
  • Linear search - simple loop over array until key is found
  • Linear search 2 - looping over every second element in array, if greater key is found, check i - 1 index too

X axis is average array length
Y axis is average time of single search in microseconds
Measurments have been done on 3 different types CPU.

Results speak for themselves.

I was surprised a little, as I were expecting binary search to outperform linear at length of 32 or 64, but it seems that modern processors are very good at optimizing linear memory access.

Provided that 8 - 128 is a practical range for BTree like structures, I will likely to reconsider some of data structures used in our database.

No comments:

Post a Comment