I have run into a weird issue, where the CPU believes that I am modifying currently executed code, and repeatedly triggers self-modifying code (SMC) machine clears.
My (simplified) program does the following:
- Allocate an executable buffer.
- Copy a 64-byte payload to some position X in the buffer.
- Call payload at position X.
- Go back to 2.
...for 100'000'000 iterations.
main.c:
#include <stdint.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
extern void smc(void *bufferPtr, void *bufferEndPtr);
int main()
{
const int BUFFER_LENGTH = 4096;
void *bufferPtr = mmap(0, BUFFER_LENGTH, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
void *bufferEndPtr = bufferPtr + BUFFER_LENGTH;
printf("Instruction block buffer: %p, %s\n", bufferPtr, strerror(errno));
smc(bufferPtr, bufferEndPtr);
return 0;
}
smc.asm:
[section .text]
align 64
payload:
ret
%define BUFFER_STEP 64
align 64
[global smc]
; rdi: bufferPtr
; rsi: bufferEndPtr
smc:
push r10
push r11
push r12
mov rax, 100_000_000
mov r10, rdi ; r10 points to begin of buffer
mov r11, rdi ; r11 points to current buffer position
mov r12, rsi ; r12 points to end of buffer
.loop:
; Done?
dec rax
je .end
mov rcx, 64
mov rdi, r11
lea rsi, [rel payload]
; Store
rep movsb
; Call
call r11
; Move buffer pointer
lea r11, [r11 + BUFFER_STEP]
cmp r11, r12
jb .next
mov r11, r10
.next:
jmp .loop
.end:
pop r12
pop r11
pop r10
ret
Compile with:
nasm smc.asm -f elf64 -o smc.o
gcc -c main.c -O2 -o main.o
gcc main.o smc.o -o prog
I measure the program's execution time and the MACHINE_CLEARS.SMC performance counter using
sudo perf stat -e r04c3 ./prog
Results on an Intel Core i7-7567U:
BUFFER_LENGTH (bytes) |
BUFFER_STEP (bytes) |
MACHINE_CLEARS.SMC |
Execution time (seconds) |
|---|---|---|---|
| 1 x 4K | 0 | 199'999'982 | 14.53 |
| 1 x 4K | 64 | 199'999'740 | 14.91 |
| 256 x 4K | 2048 | 105'550'699 | 7.89 |
| 256 x 4K | 4096 | 130'573'069 | 9.83 |
Although I am shifting the store destination (writing to a different location each time), I still get millions of SMC machine clears, leading to a massive performance penalty.
Adding various fences and/or serializing instructions before/after the store does not yield any considerable improvement. Note that, while the shifting somewhat reduces the number of machine clears, it also leads to a large number of branch target mispredictions at the call instruction.
When I run the same program with a 4K buffer, 0 byte steps, mfence after the store, and call payload instead of call r11, it only takes around 1.74 seconds, which is expected, given the total number of executed instructions.
What is causing this huge number of machine clears, and how can I work around that?