The incident that caused our main virtual server cluster to have an outage last week was fairly technical. Although outage notices and explanations were sent out by our CTO, these were on the technical side.
Here is a more basic explanation in layman’s term of what went wrong and the steps we are taking to prevent a similar incident.

Background

In order to grasp what went wrong, it is helpful to understand the basic layout of the virtual server cluster. The cluster is configured with three, front-end Dell physical machines and a dual-head NAS with hot swappable spare disks as the back-end storage.

Front-end Server Details

At any given point in time, each virtual server resides on one of the front-end physical machines. The virtual servers can move from one front-end physical machine to another while live, without interruption or dropped packets. In fact, this happens automatically for customers who have purchased High Availability and Dynamic Resource Scheduling (HA/DRS) services for their virtual servers. The HA feature automatically reboots the virtual server onto another physcial front-end machine if a hardware failure occurs on the physical machine on which the virtual server was residing. DRS moves live virtual servers, without service interruption, from one physical front-end machine to another if the original physical machine has an increased load. DRS works to ensure that a virtual server with the DRS configuration receives optimal front-end hardware resources.

Back-end Storage Details

All storage on the cluster is on a NexentaStor, dual-head NAS cluster with Seagate SAS disks. Storage on the NAS is accessed via Network File System (NFS) protocol. The cluster can easily reach more than 10,000 I/Os per second and is designed for performance and reliability. Each head unit has its own volume (containing virtual servers and all their files) and is ready to automatically take over the volume from the other head unit in case of single head failure. When the cluster was originally configured in early 2010, deduplication was turned on. Deduplication is a storage feature that can help improve disk utilization by only storing unique blocks of data instead of storing redundant copies of identical data. The decision to turn deduplication on was made after researching different white papers and blog posts. The research and data indicated that deduplication created a 10-20% savings in disk utilization and disk I/O without causing any noticible performance overhead.
NFS, the file protocol used by the NAS, employs something called a duplicate request cache to preserve data integrity. Basically the duplicate request cache stores a copy of all non-idempotent commands (ones that will modify the disk data such as WRITE, CREATE and REMOVE). New non-idempotent requests are compared against the duplicate request cache to ensure that requests are only executed once to avoid data corruption.

Outage Overview

The issues that occurred on February 15, 2011 were on the NAS storage cluster, not the front-end machines. This is why HA/DRS on the front-end servers did not prevent an outage.
At approximately 11:30am the duplicate request cache on one of the NAS controllers filled causing performance to degrade and one of the two NAS storage volumes to start going online and offline repeatedly (the front-end physical machines were getting retry errors and were timing out on the retries and then trying to remount). At this point, we attempted to use the NAS tools to move the volume that was having an issue over to the second head unit so all virtual servers could stay online while issues with the first head unit were examined. The importing of this volume to the second head unit took far longer than we anticipated. We first attempted to speed the import by forcing it. Forcing the import did not noticibly speed it up and all virtual servers in the storage volume were offline until the full volume import was completed a couple hours later.
So why did it take so long for one storage volume to be transferred from one head unit on the NAS to the second head unit? Although the volume has a lot of data, it still took longer to import than was expected. It turns out, the data deduplication feature that saved 10-20% of the disk utilization also significanlty slowed down the import process. As the volume was importing, the deduplication table was also loading. At this time it appears that the long import (and therefore the long downtime) was caused by the extra time it took to import the data deduplication table.

Steps we have taken to Prevent Future Incidents

  1. We have increased the size of the duplicate request cache storage. This was the trigger event and hopefully by increasing the cache size future similar NFS performance issues will be avoided.
  2. We are turning off all data deduplication on the NAS. This will take a bit of time. Turning off data deduplication on the volume as a whole does not remove the data that is already deduplicated. In order to fully remove the data deduplication on the VMware virtual disks that currently have data deduplication, they need to be moved to a new volume and then back to their original volume. This moving of individual VMware disks occurs without service interruption. Just to be on the safe side we are moving the VMware disks during the evenings and on weekends.
The virtual server cluster has proven to be highly reliable.  Last week was the first wholesale service interruption since its initial configuration in early 2009. Again we apologize for the outage last week. Rest assured we are doing all we can to deliver cost-effective servers and services with the critical high performance needed by your business.