I recently upgraded my server from Debian squeeze i386 to wheezy amd64 by reinstalling and reconfigurating. Additionally I wanted to be able to start virtual guests, so I installed XEN, too.
I got then the problem that from time to time OOM killer destroyed multiple processes on my Dom0. I then restarted and disabled several services (like apache2, mysql, postgresql,...). Now it seems that no processes are destroyed anymore (unsure, as it does not happen regularely but in a stochastic fashion). BUT: If I put some high load on the machine (access to encrypted filesystem), the OOM killer is activated.
Unfortunately the system is not usable anymore after the problem occured. So I cannot access via ssh to investigate. Also a physical investigation via console hangs most of the times.
I have a atop daemon running every minute so I can see the memory and swap cunsumption before the crash:
The RAM is 1GB (880MB) in total (staically allocated to Dom0, no ballooning) where aprox. 440 MB are cache. Some MB are buffers and around 20MB are free. The swap is 25GiB in total and completely free.
What I do not understand: Why does the kernel not kill some of the cache if more RAM is needed. It is cache, so all that could happen is a performance problem but the system would remain stable. This way the system crashes. Also why are unneeded memory sections used by other programs not put on swap? There should be enough space to do what ever you want.
I sometime saw a message on the console that a task (jbod/raid5 or something similar) was blocking (?) for more than 120 secs. I am not sure if this is the cause or the impact of the OOM problem.
Now my questions are:
- Could it be a XEN issue?
- Could it be a hardware issue? RAM or HD?
- What can I do to avoid future crashes?
Edit: I just tried to reproduce the error. It did crash but this time (I do not exactly now if there were other errors in other situations) the program that hung was xenwatch. So no program accessing the hd.