Wednesday, July 25, 2012

High %costop values - no CPU contention - Poor performance


High %costop values - no CPU contention - Poor performance

Presented itself as general performance problems in a regional office , specifically one of the application servers was performing very poorly , with frequent application timeouts , exchange was going offline and VM's were becoming orphaned in vSphere , i logged in to a windows server and saw that the CPU was operating at 100% , in performance view in vsphere the server was consuming approximately 300Mhz , this behaviour was repeated on all other servers in the cluster.

In the DRS view i could see that the servers were receiving appx 10% of entitled resources, there were no limits or reservations set on any of the VM's , on examining the CPU counters in ESXTOP i found that all of the servers had extremely high %costop values (~80% - 90%) , this would normally be indicative of over committed CPU resources on SMP VM’s , as ESX throttles individual CPU’s to prevent skew when some CPU’s make progress and others are unable to due to being scheduled on other VM’s. In our case this could not have been the cause as we had more physical CPU’s than vCPU’s.

During the troubleshooting I noticed that we periodically had huge latencies on the storage system , sometimes spiking to 6 seconds , the first strange thing was that the latencies were within acceptable limits until the IOPS rose above 600 , the second strange point was that combined total of CIFS and NFS IOPS were rarely sustained above 400 IOPS.

This had me stumped until i discovered that the storage array had been populated with 7.2K SATA instead of the 15K disks which i expected , immediately i saw why we weren’t seeing at least double the number of IOPS before latencies ramped up , with 8 usable disks we should see 600 IOPS , instead of the 1200 we expected.

The second point was where were the mysterious extra IOPS coming from , after more investigation we found that the Netapp 2020 has hidden aggregate level snapshots which were tipping us well over the 600IOPS threshold , these hidden jobs were set to run every 3 hours , we rescheduled these to run outside of production hours.

The high costop value can be attributed to the fact that the vCPU has to wait for IO completion and as IO completion was taking an extended period of time , ESX was costopping the CPU’s leading to extremely poor performance

No comments:

Post a Comment