Basically I am looking to implement the following in x86_64 assembly as fast as possible. (Where foo and bar may be something like glibc's hand-written asm strcpy or strcmp, and we want to start out with wide vectors but without the safety and/or performance downsides of a page-split load when one isn't needed. Or of an AVX-512 masked store: fault suppression works for correctness but is slow if it has to actually suppress a fault in the destination.)
#define TYPE __m256i
int has_page_cross(void * ptr1, void * ptr2) {
uint64_t ptr1_u64 = (uint64_t)ptr1;
uint64_t ptr2_u64 = (uint64_t)ptr2;
ptr1_u64 &= 4095;
ptr2_u64 &= 4095;
if((ptr1_u64 + sizeof(TYPE)) > 4096
|| (ptr2_u64 + sizeof(TYPE)) > 4096) {
// There will be a page cross
return foo_handling_page_cross(ptr1, ptr2);
}
return bar_with_no_page_cross(ptr1, ptr2);
}
There are a lot of really efficient ways to do this for one pointer many of which are discussed in Is it safe to read past the end of a buffer within the same page on x86 and x64? but there does not seem to be a particularly efficient approach for two pointers that does not sacrifice accuracy.
ApproachesFrom here on out assuming ptr1 is starting in rdi and ptr2 is starting in rsi. Load size will be represented by constant LSIZE.
Fast with False Positives
// cycles, bytes
movl %edi, %eax // 0 , 2 # assuming mov-elimination
orl %esi, %eax // 0 , 5 # which Ice Lake disabled
andl $4095, %eax // 1 , 10
cmpl $(4096 - LSIZE), %eax // 2 , 15
ja L(page_cross)
/* less bytes
movl %edi, %eax // 0 , 2
orl %esi, %eax // 1 , 5
sall $20, %eax // 2 , 8
cmpl $(4096 - LSIZE) << 20, %eax // 3 , 13
ja L(page_cross)
*/
- Latency : 3c
- Throughput: ~1.08c Measured (both versions).
- Bytes : 13b
This approach is nice because it is fast with 3c latency (assuming eliminated movl %edi, %eax), has high throughput, and is pretty compact for the Front End.
The obvious drawback is that it will have false positive i.e rdi = 4000, rsi = 95. I think though that its performance should serve as the goal for a full correct solution.
Slower but Correct
This is the best I've been able to come up with
// cycles, bytes
leal (LSIZE - 1)(%rdi), %eax // 0 , 4
leal (LSIZE - 1)(%rsi), %edx // 0 , 8
xorl %edi, %eax // 1 , 11
xorl %esi, %edx // 1 , 14
orl %edx, %eax // 2 , 17
testl $4096, %eax // 3 , 22
jnz L(page_cross)
- Latency : 4c
- Throughput: ~1.75c Measured (Note on Icelake so higher tput
leathan older CPUs) - Bytes : 21b
It has a 4c latency which isn't so bad, but its throughput is worse, and it has a much larger code footprint.
Question
- Can either of these approaches be improved at all in terms of latency, throughput, or bytes? Generally I am most interested in latency > throughput > bytes?
My general goal is to get the correct case as fast as the false positive.
Edit: Fixed bug in Correct version.
CPUs: Personally I am tuning for CPUs with AVX512 so Skylake Server, Icelake, and Tigerlake but this question is targetted at entire Sandybridge family.