I had a piece of code, which looked like this,
 for(i=0;i<NumberOfSteps;i++)
{
    for(k=0;k<NumOfNodes;k++)
    {
        mark[crawler[k]]++;
        r = rand() % node_info[crawler[k]].num_of_nodes;
        crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
    }
}
I changed it so that the load can be split among multiple threads. Now it looks like this,
for(i=0;i<NumberOfSteps;i++)
{
    for(k=0;k<NumOfNodes;k++)
    {            
        pthread_mutex_lock( &mutex1 );
        mark[crawler[k]]++;
        pthread_mutex_unlock( &mutex1 );
        pthread_mutex_lock( &mutex1 );
        r = rand() % node_info[crawler[k]].num_of_nodes;
        pthread_mutex_unlock( &mutex1 );
        pthread_mutex_lock( &mutex1 );
        crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
        pthread_mutex_unlock( &mutex1 );
   }
}
I need the mutexes to protect shared variables. It turns out that my parallel code is slower. But why ? Is it because of the mutexes ?
Could this possibly be something to do with the cacheline size ?
 
     
    