On Ubuntu 14.04 TLS for 36 total cores = (2 x CPUs x 9 Cores x 2 HyperThreading), lscpu give me:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 36
On-line CPU(s) list: 0-35
Thread(s) per core: 2
Core(s) per socket: 9
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Stepping: 2
CPU MHz: 1200.000
BogoMIPS: 5858.45
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-8,18-26
NUMA node1 CPU(s): 9-17,27-35
As known, data exchange faster across Cores of single CPU (via cache-L3) than across Cores of several different CPUs (via QPI-link).
0-8 and 9-17 are physical CPU-cores of two NUMA-nodes, but 18-26 and 27-35 are HyperThreading CPU-Cores, and is preferred at first to take all the physical cores, and then in the second round to take on two logical cores on each physical cores, i.e. will this increase the overall performance?
Or does it mean that if I launch more than 8 threads, for example, 12 threads, then 9 threads (0-8) will execute on the 1st CPU (NUMA node0) and 3 threads (9-12) on the 2nd CPU (NUMA node1)? And will this increase the latency of exchange between the threads and reduce the overall performance?
How can I change the distribution of Cores across NUMA-nodes to set as below?
NUMA node0 CPU(s): 0-17
NUMA node1 CPU(s): 18-35