What books and articles can you recommend to learn basis of cache coherence problems in big SMP systems (which are NUMA and ccNUMA really) with >=16 cpu sockets?
Something like SGI Altix architecture analysis may be interesting.
What protocols (MOESI, smth else) can scale up well?