We have haproxy 1.3.26 hosted on CentOS 5.9 machine having 2.13 GHz Intel Xeon processor which is acting as a http & tcp load balancer for numerous services, serving a peak throughput of ~2000 requests/second. It has been running fine for 2 years but gradually both traffic and number of services are increasing.
Off late we've observed that even after reload old haproxy process remains. On further investigation we found that old process has numerous connections in TIME_WAIT state. We also saw that netstat and lsof were taking a long long time. On referring http://agiletesting.blogspot.in/2013/07/the-mystery-of-stale-haproxy-processes.html we introduced option forceclose but it was messing up with various monitoring service hence reverted it. On further digging we realised that in /proc/net/sockstat close to 200K sockets are in tw (TIME_WAIT) state which is surprising as in /etc/haproxy/haproxy.cfg maxconn has been specified as 31000 and ulimit-n as 64000. We had timeout server and timeout client as 300s which we changed to 30s but not much use.
Now the doubts are :-
- Whether such a high number of TIME_WAITs is acceptable. If yes whats a number after which we should be worried. Looking at What is the cost of many TIME_WAIT on the server side? and Setting TIME_WAIT TCP seems there shouldn't be any issue.
- How to decrease these TIME_WAITs
- Are there any alternatives to netstat and lsof which will perform fine even if there are very high number of TIME_WAITs