I'm running a nightly CPU-intensive Java-application on an Ec2-server (c1.xlarge) which has eight cores, 7.5 GB RAM (running Linux / Ubuntu 9.10 (Karmic Koala) 64 bit).
The application is architected in such a way that a variable number of workers are constructed (each in their own thread) and fetch messages from a queue to process them.
Throughput is the main concern here and performance is measured in processed messages / second. The application is NOT RAM-bound... And as far as I can see not I/O-bound. (although I'm not a star in Linux. I'm using dstat to check for I/O-load which are pretty low and CPU wait-signals (which are almost non-existent)).
I'm seeing the following when spawning a different number of workers (worker-threads).
- Worker: throughput 1.3 messages / sec / worker 
- worker: ~ throughput 0.8 messages / sec / worker 
- worker: ~ throughput 0.5 messages / sec / worker 
- worker: ~ throughput 0.05 messages / sec / worker 
I was expecting a near-linear increase in throughput, but reality proves otherwise.
Three questions:
- What might be causing the sub-linear performance going from one worker --> two workers and two workers --> three workers? 
- What might be causing the (almost) complete halt when going from three workers to four workers? It looks like a kind of deadlock-situation or something.. (can this happen due to heavy context-switching?) 
- How would I start measuring where the problems occur? My development-box has two CPUs and is running under windows. I normally attach a GUI-profiler and check for threading-issues. But the problem only really starts to manifest itself my more than two threads. 
Some more background information:
- Workers are spawned using a Executors.newScheduledThreadPool 
- A workers-thread does calculations based on the message (CPU-intensive). Each worker-thread contains a separate persistQueue used for offloading writing to disk (and thus make use of CPU / I/O concurrency.) - persistQueue = new ThreadPoolExecutor(1, 1, 100, TimeUnit.MILLISECONDS, new ArrayBlockingQueue(maxAsyncQueueSize), new ThreadPoolExecutor.AbortPolicy()); 
The flow (per worker) goes like this:
- The worker-thread puts the result of a message in the persistQueue and gets on with processing the next message. 
- The ThreadpoolExecutor (of which we have one per worker-thread) only contains one thread which processes all incoming data (waiting in the persistQueue ) and writes it to disk (Berkeley DB + Apache Lucene). 
- The idea is that 1. and 2. can run concurrent for the most part since 1. is CPU-heavy and 2. is I/O-heavy. 
- It's possible that persistQueue becomes full. This is done because otherwise a slow I/O-system might cause flooding of the queues, and result in OOM-errors (yes, it's a lot of data). In that case the workerThread pauses until it can write its content to persistQueue. A full queue hasn't happened yet on this setup (which is another reason I think the application is definitely not I/O-bound). 
The last information:
- Workers are isolated from the others concerning their data, except: - They share some heavily used static final maps (used as caches. The maps are memory-intensive, so I can't keep them local to a worker even if I wanted to). Operations that workers perform on these caches are: iterations, lookups, contains (no writes, deletes, etc.) 
- These shared maps are accessed without synchronization (no need. right?) 
- Workers populate their local data by selecting data from MySQL (based on keys in the received message). So this is a potential bottleneck. However, most of the data are reads, queried tables are optimized with indexes and again not I/O-bound. 
- I have to admit that I haven't done much MySQL-server optimizing yet (in terms of - config -params), but I just don't think that is the problem.
 
- Output is written to: - Berkeley DB (using memcached(b)-client). All workers share one server.
- Lucene (using a home-grown low-level indexer). Each workers has a separate indexer.
 
- Even when disabling output writing, the problems occur. 
This is a huge post, I realize that, but I hope you can give me some pointers as to what this might be, or how to start monitoring / deducing where the problem lies.
 
     
     
     
     
    