Are Floating Point Computations Reliable?

or again

Is a Computer a Perfect Computing Machine?

Jean-François COLONNA
[Contact me]

www.lactamme.polytechnique.fr

CMAP (Centre de Mathématiques APpliquées) UMR CNRS 7641, École polytechnique, Institut Polytechnique de Paris, CNRS, France
france telecom, France Telecom R&D

[Site Map, Help and Search [Plan du Site, Aide et Recherche]]
[The Y2K Bug [Le bug de l'an 2000]]
[Please, visit A Virtual Machine for Exploring Space-Time and Beyond, the place where you can find more than 10.000 pictures and animations between Art and Science]
(CMAP28 WWW site: this page was created on 03/22/2004 and last updated on 03/19/2025 13:32:35 -CET-)

[en français/in french]

Keywords: Floating Point Numbers, Nombres Flottants, Rounding-off Errors, Erreurs d'arrondi.

A computer is a finite machine

the data (in order to be manipulated) must be quantified and sampled

finite

                    S  = 0
                     0


                    S  = S    + 1
                     n    n-1

                    S  = n
                     n

                    #define   N         "an arbitrary integer positive value..."


                    main()
                              {
                              int       n;
                              float     sigma=0;


                              for       (n=1 ; n <= N ; n++)
                                        {
                                        sigma = sigma + 1.0;
                                        }


                              printf("\n computed sum = %f",sigma);
                              }

the impossibility to manipulate the Real numbers

floating point numbers

4095.1 et 4096.1

rounding-off errors

loss of the property of associativity

see the same experiment using 64 bit double precision

MUL

                    float     MUL(x,y)
                    float     x,y;
                              {
                              return(x*y);
                              }


                    main()
                              {
                              float     A=3.1,B=2.3,C=1.5;
                              float     X,Y;


                              X=MUL(MUL(A,B),C);
                              Y=MUL(A,MUL(B,C));


                              printf("\n (%f x %f) x %f = %f\n",A,B,C,X);
                              printf("\n %f x (%f x %f) = %f\n",A,B,C,Y);
                              }

                    (3.10 * 2.30) * 1.50 = 10.695000 = 0x412B1EB8
                                                ####            #

                    3.10 * (2.30 * 1.50) = 10.694999 = 0x412B1EB7
                                                ####            #

java

is that true

                    main()
                              {
                              float     A=3.1,B=2.3,C=1.5;
                              float     X,Y;


                              X=(A*B)*C;
                              Y=A*(B*C);


                              printf("\n (%f x %f) x %f = %f",A,B,C,X);
                              printf("\n %f x (%f x %f) = %f",A,B,C,Y);
                              }

the compiler replaces 32-bit computations with 64-bit ones,
the compiler translates in the same way the two expressions (AxB)xC and Ax(BxC).

MUL

see the same experiment us ing 64 bit double precision

ADD

                    float     ADD(x,y)
                    float     x,y;
                              {
                              return(x+y);
                              }


                    main()
                              {
                              float     A=1.1,B=3.3,C=5.5;
                              float     X,Y;


                              X=ADD(ADD(A,B),C);
                              Y=ADD(A,ADD(B,C));


                              printf("\n (%f x %f) x %f = %f\n",A,B,C,X);
                              printf("\n %f x (%f x %f) = %f\n",A,B,C,Y);
                              }

                    (1.10 + 3.30) + 5.50 = 9.900000 = 0x411E6666
                                                  #            #

                    1.10 + (3.30 + 5.50) = 9.900001 = 0x411E6667
                                                  #            #

the distributivity of the multiplication over the addition

This "anomaly" is very helpfull to recall us that a computer is not able to perform, except for some particular cases (the one of small integer numbers, for example), exact computations!

non linear phenomena

Verhulst Dynamics

                                       2
                    X  = (R+1)X    - RX
                     n         n-1     n-1

                    X
                     0


                    |
                     ---------------
                              |     |
                             \|/   \|/
                              '     '


                                     2
                    X  = (R+1)X  - RX
                     1         0     0


                    |
                     ---------------
                              |     |
                             \|/   \|/
                              '     '


                                     2
                    X  = (R+1)X  - RX
                     2         1     1


                    |
                     ---------------
                              |     |
                             \|/   \|/
                              '     '


                                     2
                    X  = (R+1)X  - RX
                     3         2     2


                    |
                     ---------------
                              |     |
                             \|/   \|/
                              '     '


                    etc...

                    main()
                              {
                              double    R=3.0;
                              double    X=0.5;
                              int       n;


                              for       (n=0 ; n <= 80 ; n++)
                                        {
                                        if        ((n%10) == 0)
                                                  {
                                                  printf("\n iteration(%04d) = %9.6f",n,X);
                                                  }
                                        else
                                                  {
                                                  }


                                        X = (R+1)*X - R*X*X;
                                        }
                              }

                    O200 Silicon Graphics (R10000, IRIX 6.5.5m, cc 7.2.1):

                              option '-O2'   option '-O3'

                    X(00)   = 0.500000       0.500000
                    X(10)   = 0.384631       0.384631
                    X(20)   = 0.418895       0.418895
                    X(30)   = 0.046399       0.046399
                    X(40)   = 0.320184       0.320192
                    X(50)   = 0.063747       0.059988
                    X(60)   = 0.271115       1.000531
                    X(70)   = 1.328462       1.329692
                    X(80)   = 0.817163       0.021952

                      16         16
                    10   + 1 = 10

Regarding the significance of the syntax used

Java

write once, run everywhere

double

                                        O200 Silicon Graphics (processeur R10000, IRIX 6.5.5m, Java):

                              (R+1)X-R(XX)         (R+1)X-(RX)X         ((R+1)-(RX))X        RX+(1-(RX))X         X+R(X-(XX))

                    X(0000) = 0.5                  0.5                  0.5                  0.5                  0.5
                    X(0500) = 1.288736212247168    0.007057813075738616 1.2767485100695732   1.246534177059494    0.03910723014701789
                    X(1000) = 1.3327294162589722   0.916560711983132    1.207710752523091    0.27770146115891703  0.26663342726567785
                    X(1500) = 1.1448646685382955   0.4481000759915065   0.3102077001456977   0.015374092695375374 0.9841637252962943
                    X(2000) = 1.0548628914440754   0.896126931497168    0.6851138190159249   0.009229885271816535 0.3860923315999224
                    X(2500) = 1.292802584458599    0.06063433547953646  1.174118726001978    0.6922411856638806   0.020878761210912034
                    X(3000) = 1.0497821908090537   0.0219606878364607   1.3287403237319588   0.11354602472378028  0.13270749449424302
                    X(3500) = 0.8115039383609847   1.3213031319440816   0.6545151597367076   0.5760786099237328   1.324039473116061
                    X(4000) = 0.04922223042798102  1.3203298564077224   0.09243804931690679  0.9496284087750142   1.316597313359563
                    X(4500) = 0.4745896653599724   0.32865616721789603  0.018965010461877246 0.25384661313701296  0.18512853535354462

                                        PC (processeur Pentium II, LINUX Mandrake 7.0, Java):

                              (R+1)X-R(XX)         (R+1)X-(RX)X         ((R+1)-(RX))X        RX+(1-(RX))X         X+R(X-(XX))

                    X(0000) = 0.5                  0.5                  0.5                  0.5                  0.5
                    X(0500) = 1.2887362122471679   0.00705781307573862  1.2767485100695732   1.2465341770675666   0.03910723014701789
                    X(1000) = 1.3327294162589722   0.91656071198313205  1.207710752523091    0.6676224369769922   0.26663342726567785
                    X(1500) = 1.1448646685382955   0.44810007599150647  0.31020770014569771  0.41049165176544455  0.98416372529629426
                    X(2000) = 1.0548628914440754   0.89612693149716804  0.68511381901592494  1.0026346845706315   0.3860923315999224
                    X(2500) = 1.3328681064703032   0.06063433547953646  1.1741187260019781   0.0154001182074282   0.02087876121091203
                    X(3000) = 1.2956769824648844   0.0219606878364607   1.3287403237319588   0.50504896336548377  0.13270749449424302
                    X(3500) = 0.19193027175727995  0.37986077053509781  0.6545151597367076   0.38299434265835819  1.324039473116061
                    X(4000) = 1.2491385720940165   0.96017143401896088  0.09243804931690679  0.6565274346305322   1.316597313359563
                    X(4500) = 0.00644889182443986  1.3185465795235045   0.01896501046187725  0.94966313327336349  0.18512853535354462

Verhulst Dynamics

To increase the precision of the computation

scale

                    scale   = 010  020  030  040  050  060  070  080  090  100
                    n       = 035  070  105  136  169  199  230  265  303  335

linear models

following program

                    main()
                              {
                              double    B=4095.1;
                              double    A=B+1;


                              double    x=1;


                              int       n;


                              printf("initialisation      x = %.16f\n",x);


                              for       (n=1 ; n <= 9 ; n++)
                                        {
                                        x = (A*x) - B;


                                        printf("iteration %01d         x = %+.16f\n",n,x);
                                        }
                              }

                    initialisation      x =                 +1.0000000000000000
                    iteration 1         x =                 +1.0000000000004547
                    iteration 2         x =                 +1.0000000018630999
                    iteration 3         x =                 +1.0000076314440776
                    iteration 4         x =                 +1.0312591580864137
                    iteration 5         x =               +129.0406374377594148
                    iteration 6         x =            +524468.2550088063580915
                    iteration 7         x =        +2148270324.2415719032287598
                    iteration 8         x =     +8799530071030.8046875000000000
                    iteration 9         x = +36043755123945184.0000000000000000

                    x = 1

                    A-B = 1 (mathematically speaking...)

double

float

Fortran 95

Python

not to confuse mathematical linearity and numerical linearity

                    
                    B       = 4095.1                 = {0x40affe33,33333333}
                    A = B+1 = 4096.1                 = {0x40b00019,9999999a}


                    A-B     = 1.0000000000004547     = {0x3ff00000,00000800}
                                                                        |
                    1.0                              = {0x3ff00000,00000000}

The consequences

the syntax used to describe a mathematical model (as exhibited with this first "experiment"),
the compiler used, including its release and the options (as exhibited with this second "experiment"),
the processor,
the heterogeneity of the systems,...

a priori

sensitive to the initial conditions

The simultaneous use of two or more precisions

excess precision problem

'A' and 'B' are two equal numbers stored in the 80-bit floating point unit ('FPU'). At that time, it is possible to add them, to multiple them,... and to compare them without ambiguity.
Suppose that due to a lack of floating point resources, the number A is moved to memory (and then in a 64-bit format) and suppose that the optimizer "ignores" (as it is the case unfortunately...) that move.
Imagine that some instructions later, 'A' (now a 64-bit number) and 'B' (still a 80-bit number) have to be compared; 'A' is moved from memory to a floating point register and 16 (null) bits are added to it. A and B are no more equal!

                    double    f(x)
                    double    x;
                              {
                              return(1/x);
                              }


                    main()
                              {
                              double              x1=f(3.0);
                              volatile  double    x2;


                              x2=x1;


                              if        (x2 != x1)
                                        {
                                        printf("DIFFERENT\n");
                                        }
                              }

                    A > B

                    fIFGT(A,B)

                    unsigned  int       fIFGT(x,y)
                    double              x;
                    double              y;
                                        {
                                        return(x > y);
                                        }

precise order

slightly different

an error computation

N-body problem

So, why do we need real numbers to do physics

Copyright © Jean-François COLONNA, 2004-2025.
Copyright © France Telecom R&D and CMAP (Centre de Mathématiques APpliquées) UMR CNRS 7641 / École polytechnique, Institut Polytechnique de Paris, 2004-2025.

Are Floating Point Computations Reliable? or again Is a Computer a Perfect Computing Machine?

Jean-François COLONNA [Contact me]

www.lactamme.polytechnique.fr

CMAP (Centre de Mathématiques APpliquées) UMR CNRS 7641, École polytechnique, Institut Polytechnique de Paris, CNRS, France france telecom, France Telecom R&D

Copyright © Jean-François COLONNA, 2004-2025. Copyright © France Telecom R&D and CMAP (Centre de Mathématiques APpliquées) UMR CNRS 7641 / École polytechnique, Institut Polytechnique de Paris, 2004-2025.

Are Floating Point Computations Reliable?

or again

Is a Computer a Perfect Computing Machine?

Jean-François COLONNA
[Contact me]

CMAP (Centre de Mathématiques APpliquées) UMR CNRS 7641, École polytechnique, Institut Polytechnique de Paris, CNRS, France
france telecom, France Telecom R&D

Copyright © Jean-François COLONNA, 2004-2025.
Copyright © France Telecom R&D and CMAP (Centre de Mathématiques APpliquées) UMR CNRS 7641 / École polytechnique, Institut Polytechnique de Paris, 2004-2025.