High %costop values - no CPU contention - Poor performance
Presented itself as general performance problems in a regional office
, specifically one of the application servers was performing very poorly , with
frequent application timeouts , exchange was going offline and VM's were becoming orphaned in vSphere , i logged in to a windows server and saw that the CPU
was operating at 100% , in performance view in vsphere the server was consuming
approximately 300Mhz , this behaviour was repeated on all other servers in the
cluster.
In the DRS view i could see that the servers were receiving appx 10% of entitled resources, there were no limits or reservations set on any of the VM's , on examining the CPU counters in ESXTOP i found that all of the servers had extremely high %costop values (~80% - 90%) , this would normally be indicative of over committed CPU resources on SMP VM’s , as ESX throttles individual CPU’s to prevent skew when some CPU’s make progress and others are unable to due to being scheduled on other VM’s. In our case this could not have been the cause as we had more physical CPU’s than vCPU’s.
During the troubleshooting I noticed that we periodically had
huge latencies on the storage system , sometimes spiking to 6 seconds , the first
strange thing was that the latencies were within acceptable limits until the
IOPS rose above 600 , the second strange point was that combined total of CIFS
and NFS IOPS were rarely sustained above 400 IOPS.
This had me stumped until i discovered that the storage
array had been populated with 7.2K SATA instead of the 15K disks which i expected , immediately i saw why we weren’t seeing at least double the number
of IOPS before latencies ramped up , with 8 usable disks we should see 600
IOPS , instead of the 1200 we expected.
The second point was where were the mysterious extra IOPS coming
from , after more investigation we found that the Netapp 2020 has hidden
aggregate level snapshots which were tipping us well over the 600IOPS threshold
, these hidden jobs were set to run every 3 hours , we rescheduled these to run
outside of production hours.
The high costop value can be attributed to the fact that the
vCPU has to wait for IO completion and as IO completion was taking an extended
period of time , ESX was costopping the CPU’s leading to extremely poor performance
No comments:
Post a Comment