I am currently optimizing data processing logic for parallel execution. I have noticed, that as the core count increases - data processing performance does not necessary increases the way I suppose it should.
Here is the test code:
Console.WriteLine($"{DateTime.Now}: Data processing start");
double lastElapsedMs = 0;
for (int i = 1; i <= Environment.ProcessorCount; i++)
{
var watch = System.Diagnostics.Stopwatch.StartNew();
ProccessData(i); // main processing method
watch.Stop();
double elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine($"{DateTime.Now}: Core count: {i}, Elapsed: {elapsedMs}ms");
lastElapsedMs = elapsedMs;
}
Console.WriteLine($"{DateTime.Now}: Data processing end");
and
public static void ProccessData(int coreCount)
{
// First part is data preparation.
// splitting 1 collection into smaller chunks, depending on core count
////////////////
// combinations = collection of data
var length = combinations.Length;
int chuncSize = length / coreCount;
int[][][] chunked = new int[coreCount][][];
for (int i = 0; i < coreCount; i++)
{
int skip = i * chuncSize;
int take = chuncSize;
int diff = (length - skip) - take;
if (diff < chuncSize)
{
take = take + diff;
}
var sub = combinations.Skip(skip).Take(take).ToArray();
chunked[i] = sub.ToArray();
}
// Second part is itteration. 1 chunk of data processed per core.
////////////////
Parallel.For(0, coreCount, new ParallelOptions() { MaxDegreeOfParallelism = coreCount }, (chunkIndex, state) =>
{
var chunk = chunked[chunkIndex];
int chunkLength = chunk.Length;
// itterate data inside chunk
for (int idx = 0; idx < chunkLength; idx++)
{
// additional processing logic here for single data
}
});
}
The results are following:
As you can see from the result set - by using 2 cores instead of 1 - you can get almost ideal increase of performance (given the fact, that 1 core is running at 4700Mhz, but 2 cores run at 4600Mhz each).
After that, when the data was supposed to be processed in parallel on 3 cores, I was expecting to see the performance increase by 33% when compared to 2 core execution. The actual is 21.62% increase.
Next, as the core count increases - the degradation of "parallel" execution performance continues to increase.
In the end, when we have 12 cores results - the difference between actual and ideal results is more than twice as big (96442ms vs 39610ms)!
I have certainly not expected the difference to be as big. I have an Intel 8700k processor. 6 physical cores and 6 logical - total 12 Threads. 1 core running at 4700Mhz in turbo mode, 2C 4600, 3C 4500, 4C 4400, 5-6C 4400, 6C 4300.
If it matters - I have done additional observations in Core-temp:
- when 1 core processing was running - 1 out of 6 cores was busy 50%
- when 2 core processing was running - 2 out of 6 cores were busy 50%
- when 3 core processing was running - 3 out of 6 cores were busy 50%
- when 4 core processing was running - 4 out of 6 cores were busy 50%
- when 5 core processing was running - 5 out of 6 cores were busy 50%
- when 6 core processing was running - 6 out of 6 cores were busy 50%
- when 7 core processing was running - 5 out of 6 cores were busy 50%, 1 core 100%
- when 8 core processing was running - 4 out of 6 cores were busy 50%, 2 cores 100%
- when 9 core processing was running - 3 out of 6 cores were busy 50%, 3 cores 100%
- when 10 core processing was running - 2 out of 6 cores were busy 50%, 4 cores 100%
- when 11 core processing was running - 1 out of 6 cores were busy 50%, 5 cores 100%
- when 12 core processing was running - all 6 cores at 100%
I can certainly see that the end result should not be as performant as ideal result, because frequency per core decreases, but still.. Is there a good explanation why my code performs so badly at 12 cores? Is this generalized situation on every machine or perhaps a limitation of my PC?
.net core 2 used for tests
Edit: Sorry forgot to mention that data chunking can be optimized since I have done it as a draft solution. Nevertheless, splitting is done within 1 second time, so it's maximum 1000-2000ms added to the result execution time.
Edit2: I have just got rid of all the chunking logic and removed MaxDegreeOfParallelism property. The data is processed as is, in parallel. The execution time is now 94196ms, which is basically the same time as before, excluding chunking time. Seems like .net is smart enough to chunk data during runtime, so additional code is not necessary, unless I want to limit the number of cores used. The fact is, however, this did not notably increase the performance. I am leaning towards "Ahmdahl's law" explanation since nothing of what I have done increased the performance outside of bonds of an error margin.
