Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.  
Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.
Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.
- Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N)) 
- Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below). 
- Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments. 
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
extern "C" void csumasm(double&, double, double&);
__asm__(
    "csumasm:\n"
    "movsd  (%rcx), %xmm0\n" //xmm0 = a
    "subsd  (%r8), %xmm1\n"  //xmm1 - r8 (c) | y = b-c
    "movapd %xmm0, %xmm2\n"  
    "addsd  %xmm1, %xmm2\n"  //xmm2 + xmm1 (y) | b = a+y
    "movapd %xmm2, %xmm3\n" 
    "subsd  %xmm0, %xmm3\n"  //xmm3 - xmm0 (a) | b - a
    "movapd %xmm3, %xmm0\n"  
    "subsd  %xmm1, %xmm0\n"  //xmm0 - xmm1 (y) | - y
    "movsd  %xmm0, (%r8)\n"  //xmm0 to c
    "movsd  %xmm2, (%rcx)\n" //b to a
    "ret\n"
);
void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
    double y = b-c; //y is the correction of b argument
    b = a+y; //add corrected b argument to a argument. The output of the current summation
    c = (b-a)-y; //find new error to be passed as a compensation term
    a = b;
}
double fun(double fMin, double fMax){
    double f = (double)rand()/RAND_MAX;
    return fMin + f*(fMax - fMin); //returns random value
}
int main(int argc, char** argv) {
    int N = 1000;
    srand(0); //use 0 seed for each method
    double sum1 = 0;
    for (int n = 0; n < N; ++n)
        sum1 += fun(0,1);
    srand(0);
    double sum2 = 0;
    double c = 0; //compensation term
    for (int n = 0; n < N; ++n)
        csum(sum2,fun(0,1),c);
    srand(0);
    double sum3 = 0;
    c = 0;
    for (int n = 0; n < N; ++n)
        csumasm(sum3,fun(0,1),c);
    printf("Standard summation:\n %.16e (error: %.16e)\n\n",sum1,sum1-sum3);
    printf("Kahan compensated summation [g++]:\n %.16e (error: %.16e)\n\n",sum2,sum2-sum3);
    printf("Kahan compensated summation [asm]:\n %.16e\n",sum3);
    return 0;
}
The output with -O3 is:
Standard summation:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
 5.1991955320902127e+002 (error: 0.0000000000000000e+000)
Kahan compensated summation [asm]:
 5.1991955320902127e+002
The output with -O3 -ffast-math
Standard summation:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [asm]:
 5.1991955320902127e+002
It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.
Question
- Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions? 
- If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize. 
Ideas or suggestions are appreciated.
Further information
- The contents of "fun" cannot be inlined, but the "csum" function could be.
- The sum must be calculated as an iterative process (the corrected term must be applied on every single addition). This is because the intended summation function takes an input that depends on the previous sum.
- The intended summation function is called indefinitely and several hundred million times per second, which motives the pursuit of a high performance low-level implementation.
- Higher precision arithmetic such as long double, float128 or arbitrary precision libraries are not to be considered as higher precision solutions due to performance reasons.
Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)
        subsd   xmm0, QWORD PTR [rsp+32]
        movapd  xmm1, xmm3
        addsd   xmm3, xmm0
        movsd   QWORD PTR [rsp+16], xmm3
        subsd   xmm3, xmm1
        movapd  xmm1, xmm3
        subsd   xmm1, xmm0
        movsd   QWORD PTR [rsp+32], xmm1
 
    