Database Architects: Comparing Integers And Doubles

During automated testing we encountered a problem that limited transitive comparisons: if a=b, else a=c, when we assumed that b=c. Unfortunately this is not always the case, at least not in all systems. Consider the following SQL query:

select a=b, a=c, b=c
from (values(
   1234567890123456789.0::double precision,
   1234567890123456788::bigint,
   1234567890123456789::bigint)) s(a,b,c)

If you execute it in Postgres (or DuckDB, or SQL Server, or …) the answer is (right, right, wrong). That means the comparison is not transitive! Why does this happen When these systems compare bigint and double, they promote the bigint to double and then compare. But a double only has 52 bits of mantissa, which means it will lose precision when boosted to double larger integers, causing false positives in the comparison.

This behavior is highly undesirable, first because it confuses the optimizer, and second because (at least in our system) joins work very differently: hash joins promote the most restrictive types and discard all values that cannot be represented, as they will almost certainly never produce a join partner. For double/bigint joins there is a noticeable difference between joins and normal comparisons, which is too bad.

How should we make the right comparison? Conceptually the situation is clear, an IEEE 754 floating point with sign s, mantissa m and exponent e represents the value (-1)^s*m*2^e, all we have to do is compare the integer with that value. But there is no easy way to do this, if we do int/double comparison in, for example, C++, the compiler does the same promotion for double, which messes up the comparison.

We can correct the logic by performing two conversions: We first convert the int to a double and compare it. If the values are not equal, the order is clear and we can use it. Otherwise, we convert the double back to an integer and check whether the conversion rounded up or down, and handle the result. As well as some additional checks to avoid undefined behavior (the conversion to intmax64->double->int64 is not defined) and to handle non-finite values, and we get:

int cmpDoubleInt64(double a, int64_t b) {
   // handle intmax and nan
   if (!(a<=0x1.fffffffffffffp+62)) return 1;

   // fast path comparison
   double bd = b;
   if (a!=bd) return (a<bd)?-1:1;

   // handle loss of precision
   int64_t ai = a;
   if (ai!=b) return (ai<b)?-1:1;
   return 0;
}

Which is the logic we use now. And who does it correctly? Perhaps somewhat surprisingly, Python and SQLLite. Other database systems (and programming languages) we tested lost all accuracy during comparison, causing many problems. IMHO a proper int/double comparison should be available in every programming language, at least as a library function. But this is not the case in most languages (and DBMSes). If you ever face this problem you can use the code above.

Database Architects: Comparing Integers and Doubles

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply