What do you do when somebody comes to live next door and spoils the whole neighborhood? They leave, you leave or you learn to live with it.
What does that have to do with cloud computing? I will give you an example.
This site is currently hosted on an Amazon EC2 virtual machine (VM). In all likelihood, there are other VMs hosted on the same physical machine. Let’s call these other VMs neighbors.
Over the past few weeks I had several occasions where this site was really slow, for extended periods, like 20 minutes. Deeper inspection revealed that a lot of the CPU time was reported as ‘stolen’. See here for more explanation on stolen CPU time. Looking further over a period of time with tools such as NewRelic leads to the hypothesis that another VM on that physical machine is running CPU bound batches. That is a bad neighbor.
I don’t want to live with this, and I cannot force this neighbor out. So I have to move. This turned out to be easier than I thought. On Amazon EC2 my VM is running with EBS storage, which means its ‘hard disk’ exists even if the VM is gone. The VM instance also has a so-called Elastic IP address, which is independent of the physical machine.
So I stopped the VM, and restarted it. It then landed on a different physical machine with different neighbors. Users saw only a brief interruption, nothing worse than what we were going through anyway.
With proper load balancing, users would not notice it at all. Apparently, Netflix is reportedly doing the same thing: killing VMs that have landed in bad neighborhoods.
2 Comments on “Bad neighbors in the cloud”
Christ Leijtens21 June 2012 at 09:24
Does the article from Adrian Otto on stolen CPU time as reported in vmstat output not say the opposite? That the time is only stolen from you when you have nothing running and keep your entitlement to the CPU idle?
pve21 June 2012 at 09:37
The semantics are not fully clear to me. However, when 100% of your CPU time is stolen, there is little performance left in any case!