同样的代码,在ARM64和x86_64分别运行,发现部分计算结果从小数点后17位开始出现不同,双精度浮点运算结果有细微差异。
为了方便定位差异,我们编写一段简单浮点运算代码。
#include <math.h> #include <stdio.h> int main() { double dv[] = {-0.13942759833577333949961030157283, -0.046687081540714665817137785097657, -0.48496455527857718070805503884912, -0.60722091847450498924843032000354, 1, 1, 0, 0, -3.4221570491790771484375}; double dw[] = {0.019916933333333299710465880139054, -0.021317733333333300366208007403657, 0.13007783333333300390677322866395, 0.0013823433333333299752321288167423, 0.11358973333333299837732965897885, 0.21664233333333299258427473432675, -0.078728843333333298204479433479719, 0.31518266666666699959975517231214, 0.10316096666666700609749085515432 }; double dp = 0.0; int i = 0; for(i=0; i<9; i++){ dp += dv[i]*dw[i]; printf("\n========================"); printf("\ndv[%d]:%.32f ", i,dv[i]); printf("\ndw[%d]:%.32f ", i, dw[i]); printf("\ndp=dp+dv[%d]*dw[%d]:%.32f ", i,i,dp); printf("\n"); } return 0; }
打开-O3编译选项,代码编译后,在x86_64机器上运行结果为:
[root@localhost chen]# ./floating_precision
========================
dv[0]:-0.13942759833577333949961030157283
dw[0]:0.01991693333333329971046588013905
dp=dp+dv[0]*dw[0]:-0.00277697018088037071020801604959
========================
dv[1]:-0.04668708154071466581713778509766
dw[1]:-0.02131773333333330036620800740366
dp=dp+dv[1]*dw[1]:-0.00178170742648382784967253655850
========================
dv[2]:-0.48496455527857718070805503884912
dw[2]:0.13007783333333300390677322866395
dp=dp+dv[2]*dw[2]:-0.06486484602058455173345663524742
========================
dv[3]:-0.60722091847450498924843032000354
dw[3]:0.00138234333333332997523212881674
dp=dp+dv[3]*dw[3]:-0.06570423380909833077634374376430
========================
dv[4]:1.00000000000000000000000000000000
dw[4]:0.11358973333333299837732965897885
dp=dp+dv[4]*dw[4]:0.04788549952423466760098591521455
========================
dv[5]:1.00000000000000000000000000000000
dw[5]:0.21664233333333299258427473432675
dp=dp+dv[5]*dw[5]:0.26452783285756764630747284172685
========================
dv[6]:0.00000000000000000000000000000000
dw[6]:-0.07872884333333329820447943347972
dp=dp+dv[6]*dw[6]:0.26452783285756764630747284172685
========================
dv[7]:0.00000000000000000000000000000000
dw[7]:0.31518266666666699959975517231214
dp=dp+dv[7]*dw[7]:0.26452783285756764630747284172685
========================
dv[8]:-3.42215704917907714843750000000000
dw[8]:0.10316096666666700609749085515432
dp=dp+dv[8]*dw[8]:-0.08850519642089466065826286467200
[root@localhost chen]# gcc -o floating_precision_con -ffp-contract=on floating_precision.c
[root@localhost chen]# ./floating_precision_con
========================
dv[0]:-0.13942759833577333949961030157283
dw[0]:0.01991693333333329971046588013905
dp=dp+dv[0]*dw[0]:-0.00277697018088037071020801604959
========================
dv[1]:-0.04668708154071466581713778509766
dw[1]:-0.02131773333333330036620800740366
dp=dp+dv[1]*dw[1]:-0.00178170742648382784967253655850
========================
dv[2]:-0.48496455527857718070805503884912
dw[2]:0.13007783333333300390677322866395
dp=dp+dv[2]*dw[2]:-0.06486484602058455173345663524742
========================
dv[3]:-0.60722091847450498924843032000354
dw[3]:0.00138234333333332997523212881674
dp=dp+dv[3]*dw[3]:-0.06570423380909833077634374376430
========================
dv[4]:1.00000000000000000000000000000000
dw[4]:0.11358973333333299837732965897885
dp=dp+dv[4]*dw[4]:0.04788549952423466760098591521455
========================
dv[5]:1.00000000000000000000000000000000
dw[5]:0.21664233333333299258427473432675
dp=dp+dv[5]*dw[5]:0.26452783285756764630747284172685
========================
dv[6]:0.00000000000000000000000000000000
dw[6]:-0.07872884333333329820447943347972
dp=dp+dv[6]*dw[6]:0.26452783285756764630747284172685
========================
dv[7]:0.00000000000000000000000000000000
dw[7]:0.31518266666666699959975517231214
dp=dp+dv[7]*dw[7]:0.26452783285756764630747284172685
========================
dv[8]:-3.42215704917907714843750000000000
dw[8]:0.10316096666666700609749085515432
dp=dp+dv[8]*dw[8]:-0.08850519642089466065826286467200
[root@localhost chen]#
同样编译选项,在ARM64机器上运行结果为:
-bash-4.3# gcc -o floating_precision_on -O3 floating_precision.c
-bash-4.3# ./floating_precision_on
========================
dv[0]:-0.13942759833577333949961030157283
dw[0]:0.01991693333333329971046588013905
dp=dp+dv[0]*dw[0]:-0.00277697018088037071020801604959
========================
dv[1]:-0.04668708154071466581713778509766
dw[1]:-0.02131773333333330036620800740366
dp=dp+dv[1]*dw[1]:-0.00178170742648382784967253655850
========================
dv[2]:-0.48496455527857718070805503884912
dw[2]:0.13007783333333300390677322866395
dp=dp+dv[2]*dw[2]:-0.06486484602058455173345663524742
========================
dv[3]:-0.60722091847450498924843032000354
dw[3]:0.00138234333333332997523212881674
dp=dp+dv[3]*dw[3]:-0.06570423380909833077634374376430
========================
dv[4]:1.00000000000000000000000000000000
dw[4]:0.11358973333333299837732965897885
dp=dp+dv[4]*dw[4]:0.04788549952423466760098591521455
========================
dv[5]:1.00000000000000000000000000000000
dw[5]:0.21664233333333299258427473432675
dp=dp+dv[5]*dw[5]:0.26452783285756764630747284172685
========================
dv[6]:0.00000000000000000000000000000000
dw[6]:-0.07872884333333329820447943347972
dp=dp+dv[6]*dw[6]:0.26452783285756764630747284172685
========================
dv[7]:0.00000000000000000000000000000000
dw[7]:0.31518266666666699959975517231214
dp=dp+dv[7]*dw[7]:0.26452783285756764630747284172685
========================
dv[8]:-3.42215704917907714843750000000000
dw[8]:0.10316096666666700609749085515432
dp=dp+dv[8]*dw[8]:-0.08850519642089464678047505685754
-bash-4.3#
我们发现最后一个结果出现细微差异。x86_64计算结果为-0.08850519642089466065826286467200,而ARM64运算结果为-0.08850519642089464678047505685754。但不用-O3编译选项编译,浮点运算结果与x86_64完全一致。说明ARM64浮点运算精度丢失与-O3编译选项有关。
-bash-4.3# gcc -o floating_precision floating_precision.c
-bash-4.3# ./floating_precision
========================
dv[0]:-0.13942759833577333949961030157283
dw[0]:0.01991693333333329971046588013905
dp=dp+dv[0]*dw[0]:-0.00277697018088037071020801604959
========================
dv[1]:-0.04668708154071466581713778509766
dw[1]:-0.02131773333333330036620800740366
dp=dp+dv[1]*dw[1]:-0.00178170742648382784967253655850
========================
dv[2]:-0.48496455527857718070805503884912
dw[2]:0.13007783333333300390677322866395
dp=dp+dv[2]*dw[2]:-0.06486484602058455173345663524742
========================
dv[3]:-0.60722091847450498924843032000354
dw[3]:0.00138234333333332997523212881674
dp=dp+dv[3]*dw[3]:-0.06570423380909833077634374376430
========================
dv[4]:1.00000000000000000000000000000000
dw[4]:0.11358973333333299837732965897885
dp=dp+dv[4]*dw[4]:0.04788549952423466760098591521455
========================
dv[5]:1.00000000000000000000000000000000
dw[5]:0.21664233333333299258427473432675
dp=dp+dv[5]*dw[5]:0.26452783285756764630747284172685
========================
dv[6]:0.00000000000000000000000000000000
dw[6]:-0.07872884333333329820447943347972
dp=dp+dv[6]*dw[6]:0.26452783285756764630747284172685
========================
dv[7]:0.00000000000000000000000000000000
dw[7]:0.31518266666666699959975517231214
dp=dp+dv[7]*dw[7]:0.26452783285756764630747284172685
========================
dv[8]:-3.42215704917907714843750000000000
dw[8]:0.10316096666666700609749085515432
dp=dp+dv[8]*dw[8]:-0.08850519642089466065826286467200
-bash-4.3#
既然与-O3编译选项有关,我们就在ARM64上对同样代码-O3打开/关闭生成汇编代码,比较生成汇编代码差异。
对比发现没有-O3时,dp += dv[i]*dw[i];语句生成的汇编指令为fmul,而打开-O3时,生成汇编指令为fmadd。即fmadd指令运算导致浮点精度丢失。
-bash-4.3# cat floating_precision.s.ffp_off
.L2:
ldr d9, [x23,x19,lsl 3]
mov x0, x25
ldr d0, [x24,x19,lsl 3]
fmul d1, d0, d9
str q0, [x29,96]
fadd d8, d8, d1
bl printf
ldr q0, [x29,96]
mov w1, w19
mov x0, x22
bl printf
-bash-4.3# cat floating_precision.s.O3
.L2:
ldr d9, [x24,x19,lsl 3]
mov x0, x23
ldr d0, [x25,x19,lsl 3]
fmadd d8, d0, d9, d8
str q0, [x29,96]
bl printf
ldr q0, [x29,96]
mov w1, w19
mov x0, x22
bl printf
fmov d0, d9
mov w1, w19
mov x0, x21
bl printf
fmadd是针对fmul指令的优化。编译器打开-ffp-contract时,就使用fmadd指令,取代fmul。而我们ARM64上的编译器,恰好默认是-ffp-contract=fast,这样就生成了fmadd指令,取代fmul,导致双精度浮点运算精度丢失。
在编译时,指定-ffp-contract=off,关闭浮点运算优化,运算结果就和x86_64完全一致。
-ffp-contract=
style-ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression contraction if allowed by the language standard. This is currently not implemented and treated equal to -ffp-contract=off. The default is -ffp-contract=fast.
FABS Sd, Sn |
Calculates the absolute value. |
FNEG Sd, Sn |
Negates the value. |
FSQRT Sd, Sn |
Calculates the square root. |
FADD Sd, Sn, Sm |
Adds values. |
FSUB Sd, Sn, Sm |
Subtracts values. |
FDIV Sd, Sn, Sm |
Divides one value by another. |
FMUL Sd, Sn, Sm |
Multiplies two values. |
FNMUL Sd, Sn, Sm |
Multiplies and negates. |
FMADD Sd, Sn, Sm, Sa |
Multiplies and adds (fused). |
FMSUB Sd, Sn, Sm, Sa |
Multiplies, negates and subtracts (fused). |
FNMADD Sd, Sn, Sm, Sa |
Multiplies, negates and adds (fused). |
FNMSUB Sd, Sn, Sm, Sa |
Multiplies, negates and subtracts (fused). |
FPINTy Sd, Sn |
Rounds to an integral in floating-point format (where y is one of a number of rounding mode options) |
FCMP Sn, Sm |
Performs a floating-point compare. |
FCCMP Sn, Sm, #uimm4, cond |
Performs a floating-point conditional compare. |
FCSEL Sd, Sn, Sm, cond |
Floating-point conditional select if (cond) Sd = Sn else Sd = Sm. |
FCVTSty Rn, Sm |
Converts a floating-point value to an integer value (ty specifies type of rounding). |
SCVTF Sm, Ro |
Converts an integer value to a floating-point value. |
为什么指令 fmadd 相较于 fmul、fads 指令会导致精度丢失呢?
ARM64处理器是这么设计的。CPU手册里明确指出fmadd指令会有精度丢失问题。
fmadd 只在计算完成后进行了一次round,理论上是比分开计算乘加更精确的。
为什么你说fmadd指令会有精度丢失问题?
这是ARM64处理器的设计,芯片手册里已说明fmadd指令会导致精度丢失。
All floating-point Multiply-Add and Multiply-Subtract instructions are fused.
Fused multiply was introduced in VFPv4 and means that the result of the multiply is not rounded before being used in the addition. In earlier ARM floating-point architectures, a Multiply Accumulate operation would perform rounding of both the intermediate result and final results, which could potentially cause a small loss of precision.
之前的arm浮点架构,在使用乘加指令时,中间过程和结果都进行round,这时候会丢失精度,但是VFPv4中fmadd是fused的,中间结果不会round,应该是更精确的吧?
谢谢关注。这段代码是在基于A57架构的CPU上验证的。最近架构的结果,我没有验证过。
A57 是支持vfpv4的,我觉得这段代码在x86_64和armv8之间的精度差异应该不是fmadd指令。而是x86_64支持的80bit浮点计算