We have one RHEL7 system of three supposedly identical systems which regularly (every 5 days to a week) gets into some weird resource starvation scenario. The main symptom seems to be a severe delay in launching new processes (and threads?). I've tried and looked into a large number of things, in short the system looks entirely idle and memory and CPU are all readily available/idle...and while many processes may show up in a "D" state, /proc/stat shows procs_blocked 0. There does not seem to be heavy IO load.
Here's one odd manifestation of the problem I'm focusing on... on the last two of these occasions I've run the following and gotten reliably the exact (within 0.01s) same result:
$ time sleep 0
real 2m21.006s
user 0m0.000s
sys 0m0.007s
That is, an oddly specific 141 second delay to run sleep 0. Perhaps notable, I also did this:
$ strace -c sleep 0
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
59.00 0.001606 535 3 open
19.32 0.000526 131 4 mprotect
8.89 0.000242 242 1 1 access
6.58 0.000179 59 3 fstat
2.87 0.000078 9 8 mmap
0.96 0.000026 5 5 close
0.62 0.000017 17 1 munmap
0.55 0.000015 3 4 brk
0.40 0.000011 11 1 execve
0.37 0.000010 10 1 nanosleep
0.33 0.000009 9 1 read
0.11 0.000003 3 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.002722 33 1 total
So...those two don't really agree, which is puzzling...I expected to see some syscall eating up all the time, but (like many metrics in this situation) everything looks hunky-dory.
Obviously there are a million variables (e.g. this is a VM, I've looked at CPU_READY, etc.), and I've tried to look into ones that made sense to me to investigate... but that repeatable, whole-second delay is sticking out at me.
Any idea what would cause a 141 second pause in launching such a basic process? And on top of that, not show up as such in a strace?
UPDATE
Okay, two new pieces of information. We shut down that system and moved one of the virtual disks, with 15TB of stuff to the newest server so we can copy all of the stuff off--the old disk is EXT4 and the new disk is XFS, so we're mounting and copying rather than just remounting the VMDK permanently.
NOW THE NEW SYSTEM IS DOING EXACTLY THE SAME THING.
So, likely the disk we moved over is a factor. However the configs of these systems are supposed to be almost identical, so it could just be manifesting itself later. But, yeah, probably that disk is a factor.
Here's the other bit:
I tried interrupting the time sleep 0 command--that is, typing it, hitting enter, and then immediately hitting ctrl+C. Guess what?...
[matthew.williamson@nwcal-cee-ti04 ~]$ time sleep 0 ^C real 0m47.003s user 0m0.000s sys 0m0.004s
47 seconds... reliably. And if you hadn't already guessed, 47*3 = 141 seconds. So time sleep 0 is doing something three times that takes 47 seconds to return. What might that be??? guess I'll look at the strace again.
Oh... that disk is not the root partition, BTW, so neither time nor sleep should touch it... so that's weird... if it is related, it must be gumming up something real good in the kernel.