#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	156
6	Qingyu	155
7	djm03178	151
7	adamant	151
9	luogu_official	150
10	awoo	146

Introduction

I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behaviour in C++. For example:

"WA using GNU C++17 (64) and AC using GNU C++17" https://codeforces.net/blog/entry/78094

"The curious case of the pow function" https://codeforces.net/blog/entry/21844

"Why does this happen?" https://codeforces.net/blog/entry/51884

"Why can this code work strangely?" https://codeforces.net/blog/entry/18005

and many many more.

Example

Here is a simple example of the kind of weird behaviour I'm talking about

Example showing the issue

#include <iostream>
using namespace std;
 
double f(double a, double b) {
    return a * a - b;
}

int main() {
  cout.precision(60);
  
  // Calculate 10*10 - 1e-15
  double ans;
  ans = f(atof("10"), atof("1e-15"));
  cout << (double)      ans << '\n';
  cout << (int)         ans << "\n\n";
  
  ans = f(atof("10"), atof("1e-15"));
  cout << (int)         ans << '\n';
  cout << (double)      ans << "\n\n";
  
  ans = f(atof("10"), atof("1e-15"));
  cout << (double)      ans << '\n';
  cout << (long double) ans << "\n\n";
  
  ans = f(atof("10"), atof("1e-15"));
  cout << (long double) ans << '\n';
  cout << (double)      ans << "\n\n";
  return 0;
}

Output for 32 bit g++

100
100

99
100

100
100

99.99999999999999900079927783735911361873149871826171875
100

Output for 64 bit g++

Looking at this example, the output that one would expect from $$$10 * 10 - 10^{-15}$$$ is exactly $$$100$$$ since $$$100$$$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden excess precision causing the output to only sometimes(???) be $$$100$$$.

Explanation

In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of FLT_EVAL_METHOD found in cfloat. In mode 2 (which is what 32 bit g++ uses by default) all floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.

Detecting and turning on/off excess precision

Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)

Test for detecting excess precision

// #pragma GCC target("fpmath=sse,sse2") // Turns off excess precision
// #pragma GCC target("fpmath=387") // Turns on excess precision

#include <iostream>
#include <cstdlib>
#include <cfloat>
using namespace std;

int main() {
  cout << "This is compiled in mode "<< FLT_EVAL_METHOD << '\n';
  cout << "0 means no excess precision.\n";
  cout << "2 means there is excess precision.\n\n";
  
  cout << "The following test detects excess precision\n";
  cout << "0 if no excess precision, or 8e-17 if there is excess precision.\n";
  double a = atof("1.2345678");
  double b = a*a;
  cout << b - 1.52415765279683990130 << '\n';
  return 0;
}

If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.

The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags -mfpmath=sse -msse2. This will not change the value of FLT_EVAL_METHOD, but it will effectively turn off excess precision, see 81993714.

It is also possible to effectively turn on excess precision with -mfpmath=387, see 81993724.

Fun exercise

Using your newfound knowledge of excess precision, try to find a compiler + input to "hack" this

Try to hack this

#include <iostream>
#include <cmath>
using namespace std;

bool f() {
    double x;
    cin >> x;
    
    double y = x + 1.0;
    if (y >= 1.0)
        return false;
    
    int w = pow(y, 2);
    
    if (y < 1.0)
        return false;
    return y == 1.0;
}

int main() {
    if (f())
        cout << "HACKED\n";
    else
        cout << "Not hacked\n";
}

Conclusion / TLDR

32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviours. 64 bit g++ does not have this issue.

Comments (14)

Show archived | Write comment?

faker_faker

5 years ago, # |

-6

So what is the suggested solution for CP contests?

#pragma GCC target("fpmath=387") // Turns on excess precision => This seems very brittle and platform-dependent.

Does using long double for all calculations also work?

→ Reply

algmyr

5 years ago, # ^ |

+26

Solution: Don't do stupid things with floating point. Don't check floating point for equality. Be aware how floating point behaves wrt precision.

If you need extended precision you can explicitly use it, but in most cases it won't be needed if you take some care with how you do floating point operations (or avoid floating point when it's not needed).

I explicitly use -Wfloat-equal to ensure I never end up equating floating points.

However, at least in the example codes in the blog, a very simple subtraction operation is done. While I am aware of floating point issues, I would say that extended precision is more preferable than thinking about whether this particular line may lead to WA later.

← Rev. 2 →

However, at least in the example codes in the blog, a very simple subtraction operation is done.

Is there any substantial difference between a - b == 0 and a == b? The code in the blog does a - b and realizes it can give either 0 or stupidly close to zero. This is the exact same phenomenon as trying to compare two floats. I.e. not doing equality checks resolves it.

pajenegod

Yes, using long double everywhere works, doing it like that means you won't have to worry about excess precision. In general simply using 64 bit C++ means you don't have to worry about excess precision.

I wrote this blog to inform people about how floating arithmetic is done in C++. That does not mean I think excess precision is a good idea. The way I see it, excess precision is a thing of the past and I'm happy I don't have to bother with it when I use 64 bit C++.

dantrag

+65

Well explained! But the real problem is people comparing floats using a == b :)

jef

← Rev. 3 →

Doesn't this violate IEEE 754 since it requires basic floating point operations to be correctly rounded (results in closest representable value)? Is it related to https://codeforces.net/blog/entry/21844?

Edit: Seems that IEEE 754 allows excess precision.

+11

After testing I'm pretty sure https://codeforces.net/blog/entry/21844 is related to excess precision. I found a minimum working example that I believe has the same issue.

#include <iostream>
#include <cstdlib>
using namespace std;
 
double f(double a, double b) {
    return a * a - b;
}

int main() {
  cout.precision(60);
  char* a = "10";
  char* b = "0.000000000000001";
  cout << (long double) f(atof(a), atof(b)) << '\n';
  cout << (double)      f(atof(a), atof(b)) << '\n';
  cout << (int)         f(atof(a), atof(b)) << '\n';
  return 0;
}

99.99999999999999900079927783735911361873149871826171875
100
99

100
100
100

As you can see, excess precision is able to "leak out" of f.

The thing I'm not sure of is if pow having excess precision should be seen as a bug or not. From testing on cf, only when submitting under g++11 does pow have excess precision. It seems to not have excess precision on any later version. So probably pow having excess precision should be categorized as a bug, and it has now been fixed.

Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).

L0GIC

3 years ago, # |

The Code in (Detecting and turning on/off excess precision) Section:

using g++20 in Clion Gives 0 and the FLT_EVAL_METHOD says that the mode is 2 but when I change double b = a*a to long double b = a*a it give 8e-17 as I understand from you they should work the same or did i get something wrong?

and The code in (Fun exercise) Section :

how does the int w = pow(y, 2); affect the result?

3 years ago, # ^ |

The way mode 2 works is that all consecutive floating point calculations are done using long doubles. But when the floating point numbers are stored in memory, they are stored as their respecitive type.

The pow call forces y to be stored in memory, which rounds y. For example, if y had the value $$$1-2^{-60}$$$, then the pow call rounds it to exactly 1.0 since 1.0 is the closest representable double to $$$1-2^{-60}$$$.