I just read an interesting article “Has Amazon EC2 Become Oversubscribed?” The article describes one company’s relatively large-scale usage of EC2 over a three-year period. Apparently, over the first 18 months, EC2 small instances provided sufficient performance for their largely I/O bound network servers. Recently however, they have increasingly run into problems with “noisy neighbors”. They speculate that they happen to be co-located with other virtual machines that are using so much CPU that the small instances are unable to get their fair share of the CPU. They recently moved to large instances to avoid the noisy neighbor problem.
More recently however, they have been finding unacceptable local network performance even on large instances, with ping times ranging from hundreds of milliseconds to even seconds in some cases. This increase in ping time is likely due to overload on the end hosts, with the hypervisor unable to keep up with even the network load imposed by pings. (The issue is highly unlikely to be in the network because switches deployed in the data center do not have that kind of buffering.)
The conclusion from the article, along with the associated comments, is that Amazon has not sufficiently provisioned EC2 and that is the cause of the overload.
While this is purely speculation on my part, I believe that underprovisioning of the cloud compute infrastructure is unlikely to be the sole cause of the problem. Amazon has very explicit descriptions of the amount of computing power associated with each type of computing instance. And it is fairly straightforward to set the hypervisor to allocate a fixed share of the CPU to individual virtual machine instances. I am assuming that Amazon has set the CPU schedulers to appropriately reserve the appropriate portion of each machine’s physical CPU (and memory) to each VM.
For CPU-bound VM’s, the hypervisor scheduler is quite efficient at allocating resources according to administrator-specified levels. However, the achilles heel of scheduling in VM environments is I/O. More particularly, the hypervisor typically has no way to account for the work performed on behalf of individual VMs in either the hypervisor or (likely the bigger culprit) in driver domains responsible for things like network I/O or disk I/O. Hence, if a particular instance performs significant network communication (either externally or to other EC2 hosts), the corresponding system calls will first go into the local kernel. The kernel likely has a virtual device driver for either the disk or the NIC. However, for protection, the virtual device driver cannot have access to the actual physical device. Hence, the kernel driver must transfer control to the hypervisor, which in turn likely transfers control to a device driver likely running a separate domain.
The work done in the driver domain on behalf of a particular is difficult to account for. In fact, this work is typically not “billed” back to the original domain. So, a virtual machine can effectively mount a denial of service attack (whether malicious or not) on other co-located VM’s simply by performing significant I/O. With colleagues at HP Labs, we wrote a couple of papers investigating this exact issue a few years back:
- Enforcing Performance Isolation Across Virtual Machines in Xen, Diwaker Gupta, Ludmila Cherkasova, and Amin Vahdat, Proceedings of the ACM/IFIP/USENIX Middleware Conference, November 2006.
- Comparison of the Three CPU Schedulers in Xen, Diwaker Gupta, Lucy Cherkasova, and Amin Vahdat, ACM SIGMETRICS Performance Evaluation Review (PER) 35(2):42-51, September 2007.
As mentioned above however, without having access to the actual workloads on EC2, it is impossible to know whether the hypervisor scheduler is really the culprit. It will be interesting to see whether Amazon has a response.