I have a very large array of ~30M objects approximately 80bytes apiece – that's ~2.2GB for those following along – stored on the disk. The actual size of each object varies a little because each one has a QMap<quint32, QVariant> child.
Unpacking those objects from raw data is expensive, so I've implemented a multithreaded read operation that pulls a few MB from disk sequentially and then passes each raw data block to a thread to get unpacked in parallel via QtConcurrent. My objects are created (via new) on the heap inside the working threads and then passed back to the main thread for the next step. Upon completion, these objects are deleted on the main thread.
In a single-threaded environment, this deallocation is relatively fast (~4-5 seconds). However, when multithreaded on 4 threads this deallocation is incredibly slow (~26-36 seconds). Profiling this with Very Sleepy indicates that the slowdown is in MSVCR100 free, so it's the deallocation itself that is slow.
Searching around SO suggests that allocating and deallocating on different threads is safe. What is the source of the slowdown, and what can I do about it?
Edit: Some sample code communicating the idea of what's going on: For the sake of troubleshooting, I have completely removed the disk IO from this example and simply create the objects and then delete them.
class MyObject
{
public:
MyObject() { /* set defaults... irrelevant here */}
~MyObject() {}
QMap<quint32, QVariant> map;
//...other members
}
//...
QList<MyObject*> results;
/* set up the mapped lambda functor (QtConcurrent reqs std::function if returning) */
std::function<QList<MyObject*>(quint64 chunksize)>
importMap = [](quint64 chunksize) -> QList<MyObject*>
{
QList<MyObject*> objs;
for(int i = 0; i < chunksize; ++i)
{
MyObject* obj = new MyObject();
obj->map.insert(0, 1); //ran with and without the map insertions
obj->map.insert(1, 2);
objs.append(obj);
}
return objs;
}; //end import map lambda
/* set up the reduce lambda functor */
auto importReduce = [&results](bool& /*noreturn*/, const QList<MyObject*> chunkimported)
{
results.append(chunkimported);
}; //end import reduce lambda
/* chunk up the data for import */
quint64 totalcount = 31833986;
quint64 chunksize = 500000;
QList<quint64> chunklist;
while(totalcount >= chunksize)
{
totalcount -= chunksize;
chunklist.append(chunksize);
}
if(totalcount > 0)
chunklist.append(totalcount);
/* create the objects concurrently */
QThreadPool::globalInstance()->setMaxThreadCount(1); //4 for multithreaded run
QElapsedTimer tnew; tnew.start();
QtConcurrent::mappedReduced<bool>(chunklist, importMap, importReduce, QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce);
qDebug("DONE NEW %f", double(tnew.elapsed())/1000.0);
//do stuff with the objects here
/* delete the objects */
QElapsedTimer tdelete; tdelete.start();
qDeleteAll(results);
qDebug("DONE DELETE %f", double(tdelete.elapsed())/1000.0);
Here are the results with and without inserting data to MyObject::map, and with 1 or 4 threads available to QtConcurrent:
- 1 Thread:
tnew= 2.7 seconds;tdelete= 1.1 seconds - 4 Threads:
tnew= 1.8 seconds;tdelete= 2.7 seconds - 1 Thread + QMap:
tnew= 8.6 seconds;tdelete= 4.6 seconds - 4 Threads + QMap:
tnew= 4.0 seconds;tdelete= 48.1 seconds
In both scenarios it takes significantly longer to delete the objects when they were created in parallel on 4 threads vs. in serial on 1 thread, which was further exacerbated by inserting to QMap in parallel.