Proportional Share
ESX
uses a proportional share algorithm to schedule virtual machine worlds to run
on physical CPUs. A world is more or less a process. Thanks to all
of the resource management capabilities that ESX provides, it is not always so
easy to decide who gets to process first, and for how long. Typically, if
all things are equal, you would just create a list, and have every process
execute in turn. Well, reservations, limits, shares, add an added
dimension to that process. So, ESX factors in that information and
creates something called an entitlement. With that, you now have a world,
and a world has an entitlement. If a world is entitled to a resource, and
has not yet consumed it, then it now has priority to be scheduled to run. As
and when that world runs, its run time has to be accounted for, so as to keep
track of the consumption of the entitlement. That time is called
charging. In other words, a vm is entitled to resources, and is charged
for consumption of those resources. The vm in essence prepays for “talk
time” with its entitlements. As and when it talks, it is charged for that
time, and its balance goes down. If it hasn’t consumed all of its time,
it gets priority when it wants to make a call. The higher the talk time
remaining, the greater its priority.
Simple
enough, I suppose, but what happens when a vm has more than a single vCPU to
worry about. This is where co-scheduling comes into play.
Co-Scheduling
Simply
put, if you create a virtual machine with more than 1 vCPU, ESX will make every
attempt to schedule those worlds to run together. That is co-scheduling
in its simplest terms. There are additional considerations that have to
be made to keep the guest OS happy when you use multiple processors. The
processor access has to be symmetric, hence the Symmetric MultiProcessing in
vSMP. If you have a single vCPU making progress out of a two vCPU
configuration, that difference in progress is called skew. Now, I say
progress, instead of simple runtime, because a vCPU can make progress when it
uses CPU, or when it halts or idles.
When
skew gets beyond a certain threshold, the guest OS will think one of its
processors have died. If you’re lucky, the OS will mark the processor
bad, and keep chugging along. If you’re not lucky, and most of us who
have dealt with the physical server world know that when an OS loses a CPU you
will not be lucky and the OS will crash and burn. There’s a reason that
most every admin knows the meaning of a BSOD/PSOD or a kernel panic. How
the skew is managed has changed over ESX scheduler iterations, from Strict, to
Relaxed, to “even more Relaxed”?
Strict Co-Scheduling
Strict
co-scheduling was employed with ESX 2.x. Under strict co-scheduling, the
skew is cumulative per each vCPU of an SMP virtual machine, meaning the skew
grows when a vCPU does not make progress relative to any other vCPU in the same
vm. If the skew becomes greater than a set threshold, the entire virtual
machine stops processing. This vm will co-stop and will not be scheduled
again, or co-start, until there are enough physical CPUs available to schedule
all of the vCPUs simultaneously. This is an attempt to shrink the skew,
and if nothing else, to not allow the skew to grow any larger. This also
ends up causing CPU fragmentation, when a 2 vCPU vm can not run because only 1
pCPU is idle and available to process, causing a scheduling delay until a
second physical CPU is ready to go.
This
“co-stop penalty” is larger for larger SMP virtual machines, than for smaller,
because a greater number of physical CPUs need to be ready to run. So, a
2 vSMP virtual machine will be scheduled faster than a 4 vSMP virtual
machine. You can see how large this penalty is with esxtop. The
%CSTP counter under CPU represents the % of time the world spent ready to go,
but co-descheduled or co-stopped.
Relaxed Co-Scheduling
Then
along comes relaxed co-scheduling with ESX 3.x and made things quite a bit
better. Whereas previously, all vCPUs had to be scheduled together, under
relaxed co-scheduling, once a skew threshold is reached, the virtual machine
stops, but now only has to have enough pCPUs available to allow the vCPUs with
high skew to co-start together. So, in a 4 vCPU virtual machine, if only
2 vCPUs had enough skew, only those are required to co-start together, so only
2 pCPUs need to be available for those vCPU to make progress. ESX will
still make every attempt to co-start all of the vCPUs together, but it is no
longer required to do so. It will take what it can get.
ESX4 Even More Relaxed Co-Scheduling
ESX4
went even further and changed the skew detection to further reduce time spent
de-scheduled or in co-stop. Instead of a cumulative skew per each vCPU,
skew is now measured for each vCPU as the difference in progress between each
vCPU and the slowest vCPU. The virtual machine in this case does not have
to co-stop all vCPU, and the skewed vCPU co-started together. Instead,
if any vCPU individually exceeds a threshold, that vCPU will alone be
stopped until the other slow vCPUs have caught up. Now the vCPU can
be started by itself. Again, ESX will attempt to co-schedule all vCPUs
together, but it can now individually stop and start vCPUs as needed.
This
gives ESX much more opportunities to schedule small and large SMP virtual
machines than it could previously, and those virtual machines now perform much
better than they had before.
Ultimately,
remember that there is a penalty to using vSMP. Not every application
benefits from being given an additional processor to run on and it takes
overhead on the part of ESX to track and manage that additional
processor. Balance that cost with the benefit to your application, and
use vSMP sparingly.
No comments:
Post a Comment