DarraghsBlog: 2012

Thursday, November 22, 2012

Unable to power off a VM , Hangs at 95%

I had a problem with a Dev Machine that had become completely unresponsive , we were unable to ping it , and unable to control it using vsphere.
I tried to power it down but the progress halted at 95% ,
I putty'd in to the host and tried to power it down using
vim-cmd vmsvc/getallvms and getting the vmid
then
vim-cmd vmsvc/power.off XX i then checked the power state
vim-cmd vmsvc/power.getsate XX and found that the machine was still up

I then tried to pull the rug from under the VM by running

esxcli vm process list - obtained the worldID of the misbehaving VM
esxcli vm process kill --type=soft --world-id=XXXX
Checked powerstate with
vim-cmd vmsvc/power.getsate XX - failed tried a hard kill

esxcli vm process kill --type=hard --world-id=XXXX-Failed again , so last chance saloon

esxcli vm process kill --type=force --world-id=XXXX

Thursday, September 20, 2012

Netapp Powershell command to find all snapshots older than 28 days

1 Download dataontap PS module and extract to c:\windows\system32\windowspowershell\1.0\modules

2 Import the netapp powershell module
   Import-Module dataontap

3 define $28days as being todays date - 28 days
   $28days = (get-date).adddays(-28)

4 connect to the controller
   connect-nacontroller "controllername"

5 view the list of snapshots that are older than 28 days
   Get-Navol | Get-NaSnapshot | where-object { $_.name -like "smvi_*" -and $_.created -lt $28days }

6 See what snapshots will be deleted
   Get-Navol | Get-NaSnapshot | where-object { $_.name -like "smvi_*" -and $_.created -lt $28days } | remove-nasnapshot -whatif

7 Delete the snapshots
   Get-Navol | Get-NaSnapshot | where-object { $_.name -like "smvi_*" -and $_.created -lt $28days } | remove-nasnapsho

Tuesday, August 7, 2012

VMware Site Recovery Manager 5 & Netapp SRA 2.0 Failure

(Unable to export the NAS device Ensure that the correct export rules are specified in the ontap_config.txt file )

VMware Site Recovery Manager and SRA installed and configured , Sites and Resource mappings configured correctly , Netapp Protected site volumes Were mirroring correctly , SRA 2.0 array manager installed and configured correctly , Protection groups defined and configured correctly.

When i attempted to run a recovery plan for any of our sites , i would receive an error on the “recovery steps” tab which stated “Error - Failed to recover datastore 'Vol1'. Failed to create snapshots of replica devices. Failed to create snapshot of replica device /vol/Vol1m. SRA command 'testFailoverStart' failed for device '/vol/Vol1m'. Unable to export the NAS device Ensure that the correct export rules are specified in the ontap_config.txt file

I Checked the content ontap_config.txt file , this file is used to define the R/W and Root hosts for accessing the cloned export of the mirrored production volume , i confirmed that the VMkernel IP’s for the NFS VMkernel were listed.

I reset and reran the SRM test and examined the VMware-DR-XXX.log file
Here i could see that the cloned export of the paging volume came online

“--> 07-08-2012T10:34:35 Export /vol/testfailoverClone_nss_v10745371_volpagem has root & r/w IP=10.10.10.1”

But the production volumes failed

--> 07-08-2012T10:34:35 Checking existence of storage device /vol/Vol1m
--> 07-08-2012T10:34:35 Storage device /vol/Vol1m is a NFS export
--> 07-08-2012T10:34:35 Creating test Clone volume testfailoverClone_nss_v10745371_Vol1m
--> 07-08-2012T10:34:59 Mapping Export /vol/testfailoverClone_nss_v10745371_Vol1m
--> 07-08-2012T10:34:59 Modify the exportfs for path /vol/testfailoverClone_nss_v10745371_volm2
--> 07-08-2012T10:34:59 Modify failed with error: No such file or directory

I logged on to the recovery site filer and had a look at the exports file ,
In there i could see that volpagem had the correct VMkernel IP’s listed as R/W and Root hosts but for the other two production volumes , the default “All hosts” was listed for both R/W and Root hosts , after much searching i found this post :

http://communities.vmware.com/message/2051567

The key point being
“This error is caused by a flaw in the NetApp SRA 2.0. If you have an "-actual" statement in the /etc/exports file on the snapmirror destination filer the SRA will fail to create the flexvol-sharename. So if you are carefull and only use sharenames that equals volumenames for all shares (!) then you avoid the "-actual" statement and the SRA seems to work.

NetApp has confirmed this to be a bug in the SRA.

I read this to mean that if the production export contains –actual it will cause it to fail , i confirmed that none of the production exports were actual path exports , I thought i hit a brick wall until a colleague noticed that one of the unrelated exports contained an “-actual statement” i confirmed the export was not in use and removed it.
I reran the test and it succeeded.

Wednesday, July 25, 2012

Powershell script to alert for missing snapshots

We had a problem with SMVI not taking backups and to make matters worse not alerting us to the fact that it was not taking backups , the software can only alert if the backup job fails or if it generates warnings which is fine.
But what happens when the job doesnt start , the SMVI service doesnt stop and nothing alerts you to the fact that no backups are taken... this was the second time this happened , the first time could be attributed to a random occurrence but the second time it happens ... well then your waiting to get caught with your pants around your ankles...
I made a short script , scheduled to run each evening , which would send a mail if there were no snapshots less than a day old.

#gets todays date , stores it in the format day-month-year in the variable $nowDate
$NowDate =get-date -uformat "%d-%m-%y"

#Loops through c:\snapshotsourcelist.txt assigning one line to the variable $snappath
For ($file = [System.IO.File]::OpenText("c:\snapshotsourcelist.txt");
     !($file.EndOfStream); $snappath = $file.readline())
{
#recursively scan through all subdirectories of $snappath
$Snapshotlist = get-childitem $snappath -recurse |
#we are only interested in files and not directories
where-object {$_.mode -notmatch "d"} |
#cant scan this for some reason so excluding it from the search criteria
Where-object {$_.name -notlike "*iegwydc01*"} |
#returns files which have a datestamp of today
where-object {$_.lastwritetime -gt [datetime]::parse($nowDate)}

#if the variable is empty then send a mail
if (!$Snapshotlist) {

write-host 'Snapshot Alert , No Snapshots for' +$snappath 'were taken since 00:00 last night'

        $SmtpClient = new-object system.net.mail.smtpClient
        $SmtpServer = "mailserver.domain.com"
        $SmtpClient.host =
        $SmtpServer

        $From = "ie-dlitalerts@domain.com"
        $To = "Darragh@domain.com"
        $Title = 'Snapshot Alert , No Snapshots for' +$snappath
        $Body = 'Snapshot Alert , No Snapshots for' +$snappath
        $SmtpClient.Send($from,$to,$title,$Body) }

}

Storage basics , IOPS Penalty and RAID

Post from Yellow bricks showing the write penalty for various raid implementations
http://www.yellow-bricks.com/2009/12/23/iops/

Nice series of blogs to cement the basics ... again http://vmtoday.com/2009/12/storage-basics-part-i-intro/

High %costop values - no CPU contention - Poor performance

Presented itself as general performance problems in a regional office , specifically one of the application servers was performing very poorly , with frequent application timeouts , exchange was going offline and VM's were becoming orphaned in vSphere , i logged in to a windows server and saw that the CPU was operating at 100% , in performance view in vsphere the server was consuming approximately 300Mhz , this behaviour was repeated on all other servers in the cluster.

In the DRS view i could see that the servers were receiving appx 10% of entitled resources, there were no limits or reservations set on any of the VM's , on examining the CPU counters in ESXTOP i found that all of the servers had extremely high %costop values (~80% - 90%) , this would normally be indicative of over committed CPU resources on SMP VM’s , as ESX throttles individual CPU’s to prevent skew when some CPU’s make progress and others are unable to due to being scheduled on other VM’s. In our case this could not have been the cause as we had more physical CPU’s than vCPU’s.

During the troubleshooting I noticed that we periodically had huge latencies on the storage system , sometimes spiking to 6 seconds , the first strange thing was that the latencies were within acceptable limits until the IOPS rose above 600 , the second strange point was that combined total of CIFS and NFS IOPS were rarely sustained above 400 IOPS.

This had me stumped until i discovered that the storage array had been populated with 7.2K SATA instead of the 15K disks which i expected , immediately i saw why we weren’t seeing at least double the number of IOPS before latencies ramped up , with 8 usable disks we should see 600 IOPS , instead of the 1200 we expected.

The second point was where were the mysterious extra IOPS coming from , after more investigation we found that the Netapp 2020 has hidden aggregate level snapshots which were tipping us well over the 600IOPS threshold , these hidden jobs were set to run every 3 hours , we rescheduled these to run outside of production hours.

The high costop value can be attributed to the fact that the vCPU has to wait for IO completion and as IO completion was taking an extended period of time , ESX was costopping the CPU’s leading to extremely poor performance

Friday, July 6, 2012

PS Script to move users to an OU

AD Functional Account cleanup , i created this script to move a list of users from multiple OUs to an OU
where they will be mass disabled.

import-CSV c:\testdisableduseraccounts.csv -Header @("Name") | foreach-object { Get-ADUser $_.Name | Move-ADObject -Targetpath "ou=temp disabled,ou=disabled user accounts,dc=DC,dc=DC,dc=domain,dc=com" }