In a nutshell, SO_REUSEPORT socket option allow to create multiple sockets on ip:port pair. For example, program1 and program2 both can call functions chain socket()->bind()->listen()->accept() for the same port and IP, and kernel scheduler will evenly distribute incoming connections between this two programs.
I assumed that with this option, you can get rid of the use of fork() for spawning additional workers and can simply run new program instance.
I wrote a simple epoll socket server, based on this logic, and test it with weighttp:
weighttp -n 1000000 -c 1000 -t 4 http://127.0.0.1:8080/
For two running instances the results is ~44000 RPS, for one running instance - near ~51000 RPS. I am very surprised with 7000 RPS differece.
After this test I add fork() before listen() and run one instance of server, so now it has the same logic that previous implementation - two process with epoll loop listening socket, but socket()->bind() called only once, before fork(), and second process receive it FD copy for listen() call.
I run tests again and it shows ~50000 RPS!
So, my question is very simple: what magic do fork() in this case and why it works faster than two independent process with it own socket()? Kernel do the same job for scheduling, I dont see any important difference.