Donnerstag, 20. Februar 2014
Had an interesting issue today, an insanely slow, brand new server….. or so I thought. First a bit of background on the client, they are fairly large, with over 400 client access devices to maintain, not including server, network equipment etc. to support this the client has 3 servers, one purchased a year and the oldest one thrown out, keeping all devices in warranty and with modern powerful equipment to keep things running, in addition to this there are two other servers that are replaced every three years, these are treated differently as these are speciality servers, and only do one task.
So, building a new server for a client, in this case a HP ML350p gen8 with Windows Server 2012 R2 STD running as a Hypervisor, nothing special in that, I do this exceedingly regularly so it has become more of a routine build for me. What got me with this one though was during testing I was getting insanely high ping latency, not only to the virtual machines from the network and vice versa, but also from the hypervisor to the machines and vice versa. Pings to other virtual machines on others server, on different LAN segments were all responding normally in <1MS
My first thought was there was something wrong with the virtual machines and that I had butchered something in the migration, but as they worked on other hypervisors without delays, that knocked that one on the head. Then I thought network location issue, but that does not make any sense due the fact that pinging from a hypervisor to guest does not go across a physical network, so it has to be the brand new server.
Ok, so what’s new about this server, well its got newer Processors, greater memory, faster HDD’s with larger capacity’s, basically it was more a case of what wasn’t different to the last server. Not going to go through the whole process of troubleshooting, but basically it was to do with the NIC’s, fine now what about them is it. After trial and error, and of course every techs most important tool, Google I came across the issue what is it…
THE ISSUE IS VMQ or Virtual Machine Queuing inside the Broadcom NIC drivers as shown below, disable this and the issue clears instantly
Pings and other indicators are now back down to <1ms which is what I expected to see in the first place.
Hardware effected by this was as follows
HP ML350p gen8
Server 2012 R2 Standard
onboard Broadcom Quad Port NIC
Sonntag, 2. Februar 2014
While working on a Hyper-V 3.0 deployment recently, I noticed that all of the virtual machines running on all newly built hypervisors were showing an unhealthy high ping latency trying to ping anything on the local network. Pinging the same destinations from the hypervisors resulted in perfectly reasonable 1ms ping times, but pings from inside the virtual machines varied from 1ms all the way past 200ms.
I started digging, thinking initially that it was just a clock synchronization issue that we saw on AMD as well as Intel chips, specifically on HP DL servers, in Windows Server 2003 days. The issue is not very well documented but one of the few Microsoft articles that attempts to explain it here. Solution to that issue is to add /USEPMTIMER switch to the boot.ini file. It used to fix negative ping times on AMD chips as well as lengthy ping times on Intel chips.
Windows Server 2012 does not have a boot.ini file obviously, so the way to add this switch is to issue the following command in elevated command line or PowerShell:bcdedit.exe /set USEPLATFORMCLOCK on
(This explains the right side of the screenshot above – boot configuration is shown “before” and “after” running this command).
Unfortunately, applying this configuration at the hypervisor level and rebooting the server made no difference to ping times from VM (guest partitions).
Virtual Machine Queues
Virtual machine queues were introduced in Windows Server 2008. The purpose of this feature was to improve network performance of virtual machines receiving a lot of inbound traffic, by providing a more direct access to the hardware NIC. To quote directly from Microsoft:
“When VMQ is enabled, a dedicated queue is established on the physical network adapter for each virtual network adapter that has requested a queue. As packets arrive for a virtual network adapter, the physical network adapter places them in that network adapter’s queue. When packets are indicated up, all the packet data in the queue is delivered directly to the virtual network adapter. Packets arriving for virtual network adapters that don’t have a dedicated queue, as well as all multicast and broadcast packets, are delivered to the virtual network in the default queue. The virtual network handles routing of these packets to the appropriate virtual network adapters as it normally would.”
When the hypervisors were built, VMQ features of the network card on the physical host were enabled to achieve better VM network performance. Through disabling performance features such as VMQ, TCP Chimney, and Receive Side Scaling, it turned out that VMQ was the root cause for high ping latency. As soon as VMQ was disabled at the parent partition’s NIC driver level, VM ping times got down to a steady <1ms reading.
The hardware/software affected by this issue:
HP DL 360p Gen8 server chassis
4-port 1Gbps Broadcom NIC (HP Ethernet 1Gb 4-port 331FLR Adapter)
Windows Server 2012 Datacenter Edition
Broadcom’s NIC driver dated October 26 2012, version 220.127.116.11 (link to HP site)
Bottom line, always perform at least basic QA of your builds before putting them into production. Some hardware (and more likely drivers) may cause performance issues instead of providing a performance boost - especially in early releases.