Latest Posts


Total Pageviews

Thursday, 11 July 2013

What is Co-Scheduling?

Proportional Share
ESX uses a proportional share algorithm to schedule virtual machine worlds to run on physical CPUs.  A world is more or less a process.  Thanks to all of the resource management capabilities that ESX provides, it is not always so easy to decide who gets to process first, and for how long.  Typically, if all things are equal, you would just create a list, and have every process execute in turn.  Well, reservations, limits, shares, add an added dimension to that process.  So, ESX factors in that information and creates something called an entitlement.  With that, you now have a world, and a world has an entitlement.  If a world is entitled to a resource, and has not yet consumed it, then it now has priority to be scheduled to run. As and when that world runs, its run time has to be accounted for, so as to keep track of the consumption of the entitlement.  That time is called charging.  In other words, a vm is entitled to resources, and is charged for consumption of those resources.  The vm in essence prepays for “talk time” with its entitlements.  As and when it talks, it is charged for that time, and its balance goes down.  If it hasn’t consumed all of its time, it gets priority when it wants to make a call.  The higher the talk time remaining, the greater its priority.
Simple enough, I suppose, but what happens when a vm has more than a single vCPU to worry about.  This is where co-scheduling comes into play.
Simply put, if you create a virtual machine with more than 1 vCPU, ESX will make every attempt to schedule those worlds to run together.  That is co-scheduling in its simplest terms.  There are additional considerations that have to be made to keep the guest OS happy when you use multiple processors.  The processor access has to be symmetric, hence the Symmetric MultiProcessing in vSMP.  If you have a single vCPU making progress out of a two vCPU configuration, that difference in progress is called skew.  Now, I say progress, instead of simple runtime, because a vCPU can make progress when it uses CPU, or when it halts or idles.
When skew gets beyond a certain threshold, the guest OS will think one of its processors have died.  If you’re lucky, the OS will mark the processor bad, and keep chugging along.  If you’re not lucky, and most of us who have dealt with the physical server world know that when an OS loses a CPU you will not be lucky and the OS will crash and burn.  There’s a reason that most every admin knows the meaning of a BSOD/PSOD or a kernel panic.  How the skew is managed has changed over ESX scheduler iterations, from Strict, to Relaxed, to “even more Relaxed”?
Strict Co-Scheduling
Strict co-scheduling was employed with ESX 2.x.  Under strict co-scheduling, the skew is cumulative per each vCPU of an SMP virtual machine, meaning the skew grows when a vCPU does not make progress relative to any other vCPU in the same vm.  If the skew becomes greater than a set threshold, the entire virtual machine stops processing.  This vm will co-stop and will not be scheduled again, or co-start, until there are enough physical CPUs available to schedule all of the vCPUs simultaneously.  This is an attempt to shrink the skew, and if nothing else, to not allow the skew to grow any larger.  This also ends up causing CPU fragmentation, when a 2 vCPU vm can not run because only 1 pCPU is idle and available to process, causing a scheduling delay until a second physical CPU is ready to go.
This “co-stop penalty” is larger for larger SMP virtual machines, than for smaller, because a greater number of physical CPUs need to be ready to run.  So, a 2 vSMP virtual machine will be scheduled faster than a 4 vSMP virtual machine.  You can see how large this penalty is with esxtop.  The %CSTP counter under CPU represents the % of time the world spent ready to go, but co-descheduled or co-stopped.
Relaxed Co-Scheduling
Then along comes relaxed co-scheduling with ESX 3.x and made things quite a bit better.  Whereas previously, all vCPUs had to be scheduled together, under relaxed co-scheduling, once a skew threshold is reached, the virtual machine stops, but now only has to have enough pCPUs available to allow the vCPUs with high skew to co-start together.  So, in a 4 vCPU virtual machine, if only 2 vCPUs had enough skew, only those are required to co-start together, so only 2 pCPUs need to be available for those vCPU to make progress.  ESX will still make every attempt to co-start all of the vCPUs together, but it is no longer required to do so.  It will take what it can get.
ESX4 Even More Relaxed Co-Scheduling
ESX4 went even further and changed the skew detection to further reduce time spent de-scheduled or in co-stop.  Instead of a cumulative skew per each vCPU, skew is now measured for each vCPU as the difference in progress between each vCPU and the slowest vCPU.  The virtual machine in this case does not have to co-stop all vCPU, and the skewed vCPU co-started together.  Instead, if  any vCPU individually exceeds a threshold, that vCPU will alone be stopped until the other slow vCPUs have caught up.   Now the vCPU can be started by itself.  Again, ESX will attempt to co-schedule all vCPUs together, but it can now individually stop and start vCPUs as needed.
This gives ESX much more opportunities to schedule small and large SMP virtual machines than it could previously, and those virtual machines now perform much better than they had before.
Ultimately, remember that there is a penalty to using vSMP.  Not every application benefits from being given an additional processor to run on and it takes overhead on the part of ESX to track and manage that additional processor.  Balance that cost with the benefit to your application, and use vSMP sparingly.