The Achilles’ Heel of Performance Isolation in the Cloud

I just read an interesting article “Has Amazon EC2 Become Oversubscribed?” The article describes one company’s relatively large-scale usage of EC2 over a three-year period.  Apparently, over the first 18 months, EC2 small instances provided sufficient performance for their largely I/O bound network servers.  Recently however, they have increasingly run into problems with “noisy neighbors”.  They speculate that they happen to be co-located with other virtual machines that are using so much CPU that the small instances are unable to get their fair share of the CPU.  They recently moved to large instances to avoid the noisy neighbor problem.

More recently however, they have been finding unacceptable local network performance even on large instances, with ping times ranging from hundreds of milliseconds to even seconds in some cases.  This increase in ping time is likely due to overload on the end hosts, with the hypervisor unable to keep up with even the network load imposed by pings.  (The issue is highly unlikely to be in the network because switches deployed in the data center do not have that kind of buffering.)

The conclusion from the article, along with the associated comments, is that Amazon has not sufficiently provisioned EC2 and that is the cause of the overload.

While this is purely speculation on my part, I believe that underprovisioning of the cloud compute infrastructure is unlikely to be the sole cause of the problem.  Amazon has very explicit descriptions of the amount of computing power associated with each type of computing instance.  And it is fairly straightforward to set the hypervisor to allocate a fixed share of the CPU to individual virtual machine instances.  I am assuming that Amazon has set the CPU schedulers to appropriately reserve the appropriate portion of each machine’s physical CPU (and memory) to each VM.

For CPU-bound VM’s, the hypervisor scheduler is quite efficient at allocating resources according to administrator-specified levels.  However, the achilles heel of scheduling in VM environments is I/O.  More particularly, the hypervisor typically has no way to account for the work performed on behalf of individual VMs in either the hypervisor or (likely the bigger culprit) in driver domains responsible for things like network I/O or disk I/O.  Hence, if a particular instance performs significant network communication (either externally or to other EC2 hosts), the corresponding system calls will first go into the local kernel.  The kernel likely has a virtual device driver for either the disk or the NIC.  However, for protection, the virtual device driver cannot have access to the actual physical device.  Hence, the kernel driver must transfer control to the hypervisor, which in turn likely transfers control to a device driver likely running a separate domain.

The work done in the driver domain on behalf of a particular is difficult to account for.  In fact, this work is typically not “billed” back to the original domain.  So, a virtual machine can effectively mount a denial of service attack (whether malicious or not) on other co-located VM’s simply by performing significant I/O.  With colleagues at HP Labs, we wrote a couple of papers investigating this exact issue a few years back:

As mentioned above however, without having access to the actual workloads on EC2, it is impossible to know whether the hypervisor scheduler is really the culprit.  It will be interesting to see whether Amazon has a response.

4 Responses to “The Achilles’ Heel of Performance Isolation in the Cloud”

  1. 1 Etherealmind January 18, 2010 at 2:43 am

    I suspect that Amazon may be telling the truth. The problem may not be in the EC2 server cloud but in the network that supports the server infrastructure. Problems with the switching/routing core would also cause these problems but no-one seems to be considering this in the debate.

    Server folks rarely see the whole problem.

    “if your only tool is a hammer then every problem looks like a nail” – Abraham Maslow

    • 2 aminvahdat January 18, 2010 at 7:21 am

      Thanks for the comment. I am at least as much a networking person as a server person. In some sense, the problem would be much more interesting if it were in the network. Especially since I believe that the network is the “missing ingredient” in the cloud virtualization story. If latencies in the hundreds of ms to seconds is caused by the network, then this would add additional ammunition for the case for better support of the network hardware, especially for large-scale services running in the cloud. The subject of a future post…

  2. 3 Todd Deshane February 2, 2010 at 5:46 pm

    We did a study at Clarkson University on Performance Isolation back in 2007. “Quantifying the Performance Isolation Properties of Virtualization Systems”

    Here is a link to the paper:

    Click to access isolation_ExpCS_FINALSUBMISSION.pdf

  1. 1 Shared Items – February 10, 2010 « Zuo Ren Yao Hou Dao Trackback on February 10, 2010 at 1:21 pm
Comments are currently closed.

Amin Vahdat is a Professor in Computer Science and Engineering at UC San Diego.

January 2010

%d bloggers like this: