Latest Posts


Total Pageviews

Sunday, 22 June 2014

Reliable Memory in vSphere 5.5

This is a patented new technology from Dell, wherein the hypervisor and system hardware can work together to place the hypervisor in a more redundant section of memory. Dell servers have shipped with a variety of tricks to protect against memory faults, things like Memory Page Retire, which will dynamically remove a page from usable memory space if it encounters an error. However, to get better reliability than that one had to enable the memory mirroring options in the BIOS.
Of course, memory mirroring is just like RAID 1 on disk: you get half the usable space. And on these E5-2600s there aren’t a ton of DIMM sockets to start with (and right now 32 GB DIMMs are 4x the price of 16 GB DIMMs), so RAM capacity is at a premium even without mirroring. The Reliable Memory Technology essentially mirrors just a part of the memory address space, and places the hypervisor & all its processes there so that even if there are other RAM errors that take VMs down the hypervisor stays up. Think “controlled emergency landing” and not “crash landing.” And you don’t lose half your RAM.
Anyhow, that’s pretty cool, especially since it’s retroactive to all the 12th generation servers. 

Memory reliability, also known as error insolation, allows ESXi to stop using parts of memory when it determines that a failure might occur, as well as when a failure did occur.

When enough corrected errors are reported at a particular address, ESXi stops using this address to prevent the corrected error from becoming an uncorrected error.

Memory reliability provides a better VMkernel reliability despite corrected and uncorrected errors in RAM. It also enables the system to avoid using memory pages that might contain errors.
This below given procedure is used by hypervisor only when if you are not using reliable memory supported hardware.
Correct an Error Isolation Notification
With memory reliability, VMkernel stops using pages that receive an error isolation notification.
The user receives an event in the vSphere Client when VMkernel recovers from an uncorrectable memory error, when VMkernel retires a significant percentage of system memory due to a large number of correctable errors, or if there is a large number of pages that are unable to retire.
  Vacate the host.
  Migrate the virtual machines.
  Run tests.