What does a typical binary search in real values look like? Like this:
float l = 0.0f, r = 1e14f;
for (int i = 0; i < iteration_count && l + eps < r; ++i)
{
float mid = 0.5f * (l + r);
if (check(mid))
r = mid;
else
l = mid;
}
Here l
and r
are the boundaries that depends on the condition, check (x)
is a function that returns true starting from some x and false before x and we need to find this x. As a rule, l
or 0.5 * (l + r)
is the answer.
But there are several disadvantages:
- The number of iterations must be choosed balancing between the precision and the execution time.
- Answer with an error.
What can be done with this?
First, let's take a look at how real values are represented. You can read in detail, for example, here and here. We use the fact that if we perform the operation Y = real(int(X) + 1)
, where X is a real value, int(X)
takes a real value and returns an integer whose bitwise representation is the same as X (returns the same X interpreted as an integer), real(X)
does the opposite — interprets the integer as real, then Y will be equal to the next representable real value after X. Roughly this is similar to Y = X + eps
, where eps is the minimum possible and X >= 0
andY = X - eps
, if X <= 0
(in both cases, zero is included, because in real values there are 0 and -0). Roughly — because not all X has the same eps. I should note that there is NaN
,inf
, the maximum positive or negative value, but this is not what I want to talk about. It can be seen that for real L and R, where L < R, the condition int(L) < int(L) + 1 < int(L) + 2 < ... < int(R)
is satisfied, similarly as L < L + eps < L + 2 * eps < ... < R
. So we can do a binary search with the boundaries int(L)
, int(R)
. Now we can do this:
float l_real = 0.0f, r_real = 1e14f;
uint32_t l = reinterpret_cast<uint32_t&>(l_real), r = reinterpret_cast<uint32_t&>(r_real);
while (l < r)
{
uint32_t mid = l + (r - l) / 2;
if (check(reinterpret_cast<float&>(mid)))
r = mid;
else
l = mid + 1;
}
The same works for double
anduint64_t
.int
forfloat
and long long
fordouble
can be used. I should note that the standard does not guarantee that float
andint
are 32 bits each, and double
andlong long
are 64 bits each but I didn’t see that this was not the case.
What are the advantages?
- Maximum possible precision.
- A small number of iterations (up to 32 for
float
and up to 64 fordouble
).
What is important to remember?
l
,r
andmid
are some non-meaningful values, all that is important is maintaining order in both interpretations.l + r
may overflow, thereforel + (r - l) / 2
instead of(l + r) / 2
.- There is an inconvenience with boundaries with different signs. For negative values we have to convert the value from two's complement representation when converted from real to integer to keep order. You can also make a check in zero, so that the boundaries have the same sign, or add a constant to the values. In my experience problem where boundaries with different signs is a rare case.
It would not be superfluous to take a look on pictures of floating point distribution. For example, I’ll note that in the interval from 0 to std::numeric_limints<float>::max()
the median (which is also the first mid
in binary search) will be equal to 1.5. That is, from 0 to 1.5 can be represented as many values as from 1.5 to the most representable value. For float
in the usual approach, we need 129 iterations only to make the right boundary less than 1.5, if the answer is, for example, 0 (don't forget that usually the initial right boundary is much less). That is, in the usual approach, representable values are discarded extremely badly.
An example of solving a problem using this approach: 45343644
The same approach applies to ternary search (who wants to make a cool example? ^ _ ^).