Multithreaded code uses only one CPU thread while the number of threads should be more

Question

I am trying to implement multithreading to a CPU intensive simulation that normally takes around 10hrs to run. I wrote a dummy code which works fine for multithreading and uses the specified number of CPU threads. But when I implement the same thing in my main program, it uses only one thread. What could be the reason for this?

My code: photongen()

void photongen() {

    // Iterating over Muon list
    for (auto m = muonList.begin(); m != muonList.end(); ++m) {
        
        nP = 0; // Initialise the number of photons to 0

        vector<Photon> tempP;   // Temporary vector to store generated photons
        
        // Muon propagation
        bool stopLoop = false;
        while (!stopLoop) {
            
            // Photon generation
            if (nPhotons()) {
                for (int i = 0; i < nPhotons(); i++){

                    // Create photon instance
                    Photon *P = new Photon(m->x, m->y, m->z, m->run, m->event, i);
                    tempP.push_back(*P);

                    delete P;   // Free memory by photon object
                }
            }

            // Check if muon has left the detector
            if (m->x > S_LENGTH/2. - PATH_STEP*sin(m->theta)*cos(m->phi) || m->x < -S_LENGTH/2. - PATH_STEP*sin(m->theta)*cos(m->phi)) {
                stopLoop = true;
            } else if (m->y > S_WIDTH/2. - PATH_STEP*sin(m->theta)*sin(m->phi) || m->y < -S_WIDTH/2. - PATH_STEP*sin(m->theta)*sin(m->phi)) {
                stopLoop = true;
            } else if (m->z > S_HEIGHT - PATH_STEP*cos(m->theta)) {
                stopLoop = true;
            } else {
                // Update Muon path
                m->x += PATH_STEP*sin(m->theta)*cos(m->phi);
                m->y += PATH_STEP*sin(m->theta)*sin(m->phi);
                m->z += PATH_STEP*cos(m->theta);
            }
        }

        multiprocess(8, tempP, &propagation);
        tempP.clear();
    }
}

propagation()

void propagation(Photon *P) {
    if (photonProp(P)) {
        pLock.lock();
        photonList.push_back(*P);
        nP++;
        pLock.unlock();
    }
}

photonProp()

bool photonProp(Photon*& P) {

    float dist[3];
    for (int i = 0; i < MAX_REFL; i++) {
        
        // Precalculation for optimisation
        float theta = P->theta;
        float phi = P->phi;

        float sintheta = sin(theta);
        float costheta = cos(theta);
        float sinphi = sin(phi);
        float cosphi = cos(phi);

        float P_x = P->x;
        float P_y = P->y;
        float P_z = P->z;

        // Find plane of intersection of particle
        dist[0] = (sintheta*cosphi > 0) ? (S_LENGTH/2. - P_x)/(sintheta*cosphi) : (-S_LENGTH/2. - P_x)/(sintheta*cosphi);
        dist[1] = (sintheta*sinphi > 0) ? (S_WIDTH/2. - P_y)/(sintheta*sinphi) : (-S_WIDTH/2. - P_y)/(sintheta*sinphi);
        dist[2] = (costheta > 0) ? (S_HEIGHT - P_z)/costheta : -P_z/costheta;

        // Calculate the plane of intersection
        int min_index = distance(dist,min_element(dist, dist + 3));
        float d = dist[min_index]; // distance from original point to plane

        // Check for attenuation
        float lambda_sc = 100.;
        float x_surv = -lambda_sc*log(r->Uniform());
        if (d > x_surv) return false;
        
        // Update particle coords
        P->x = P_x + d*sintheta*cosphi;
        P->y = P_y + d*sintheta*sinphi;
        P->z = P_z + d*costheta;

        // check if particle hit photodetector
        if (min_index  == 0 && sintheta*cosphi > 0 && P->y <= -S_WIDTH/2. + 1) {
            return true;
        }

        // Check for reflective loss
        if (r->Uniform() < 0.05) return false;

        // reflect particle
        if (min_index  == 0) {
            P->phi = M_PI - phi;
        } else if (min_index == 1) {
            P->phi = -phi;
        } else if (min_index == 2) {
            P->theta = -theta;
        }
    }
    
    return false;
}

multiprocess()

template <typename T>
void multiprocess(int nThreads, vector<T> arg, void (*func)(T*)) {

    size_t i;

    for (i=0; i < arg.size() - nThreads; i += nThreads) {
        
        vector<thread> t(nThreads);

        for (int j = 0; j < nThreads; j++){
            t[j] = thread(func, &arg[i+j]);
        }

        for (int j = 0; j < nThreads; j++){
            t[j].join();
        }
    }

    vector<thread> t(arg.size()%nThreads);

    for (std::size_t j = 0; j < arg.size()%nThreads; j++){ 
        t[j] = thread(func, &arg[i+j]);        
    }

    for (std::size_t j = 0; j < arg.size()%nThreads; j++){
        t[j].join();
    } 
}

If I am reading this right, it looks like `photonProp` works on just one "photon". Given that it takes time to setup a thread, I doubt you'll see much improvement constantly creating/joining threads. Instead, consider breaking the data up into 8 chunks (assuming 8 cores) and create a thread to process each chunk. — 001, Dec 02 '21 at 19:50
nPhotons() returns an integer around 100 or so. The average size of tempP is around 20000. photonProp works only on one photon, yes — RISHABH MEHTA, Dec 02 '21 at 19:53
nPhotons() returns a different value everytime, as it has a random number component — RISHABH MEHTA, Dec 02 '21 at 19:57
@500-InternalServerError ok, I will look into that! But will the problem go away if I implement it that way? — RISHABH MEHTA, Dec 02 '21 at 19:59
The function `void multiprocess(int nThreads, vector arg, void (*func)(T*)) ` looks entirely broken. To be fair, so does most of this code though. — EOF, Dec 02 '21 at 20:08
I haven't looked carefully, but my impression is that the code as shown doesn't make good use of threads. The task run by each thread is small, just some calculations on a single photon, then a lock to update some shared information. Each thread should do much more work. I'd give each thread a larger bunch of photons to work on, and give each thread a separate place to store its results. Then, when the threads have finished, integrate all of their results. — Pete Becker, Dec 02 '21 at 20:08
There is no guarantee that multiple threads will be running on multiple cores or dedicated cores. Worst case, your threads are treated as tasks and queued up to run on a single core. — Thomas Matthews, Dec 02 '21 at 20:14
If you are looking at running multiple processes (same process) consider using GPU cores instead. They can run in parallel and the CPU doesn't use them much. You'll probably get more efficiency out of GPU cores than creating multiple threads. — Thomas Matthews, Dec 02 '21 at 20:17
Also consider designing your program in a pipeline execution format. This will help your processor perform operations in parallel. The process can queue up instructions either in parallel or using multiple instruction pipelines. Also review your data structures for processor data cache efficiency. You'll get more performance using these techniques in a single thread rather then using multiple threads. — Thomas Matthews, Dec 02 '21 at 20:22
Not harmful, but `Photon *P = new Photon(m->x, m->y, m->z, m->run, m->event, i); tempP.push_back(*P); delete P;` is ... silly. There is no need for a dynamic allocation here. You could write `Photon P(m->x, m->y, m->z, m->run, m->event, i); tempP.push_back(P);` or with a modern compiler `tempP.emplace_back(m->x, m->y, m->z, m->run, m->event, i);`.In general, [only use `new` when forced to](https://stackoverflow.com/questions/6500313/why-should-c-programmers-minimize-use-of-new). — user4581301, Dec 02 '21 at 20:30
@ThomasMatthews thanks a lot! I will definitely look them up and try to improve the code. — RISHABH MEHTA, Dec 02 '21 at 20:38
@user4581301 thanks for the tip! I am fairly new to cpp and am unaware of optimal practices, so this helps! — RISHABH MEHTA, Dec 02 '21 at 20:39

Multithreaded code uses only one CPU thread while the number of threads should be more

0 Answers0