High availability 编辑
High availability is a set of automatic features designed to plan for, and safely recover from issues which take down Citrix Hypervisor servers or make them unreachable. For example, during physically disrupted networking or host hardware failures.
Overview
High availability ensures that when a host becomes unreachable or unstable, VMs running on that host are shut down and restarted on another host. Shutting down and restarting VMs on another host avoids the VMs being started (manually or automatically) on a new host. At some point later, the original host is recovered. This scenario can lead two instances of the same VM running on different hosts, and a corresponding high probability of VM disk corruption and data loss.
When the pool coordinator becomes unreachable or unstable, high availability can also recover administrative control of a pool. High availability ensures that administrative control is restored automatically without any manual intervention.
Optionally, high availability can also automate the process of restarting VMs on hosts which are known to be in a good state without manual intervention. These VMs can be scheduled for restart in groups to allow time to start services. It allows infrastructure VMs to be started before their dependent VMs (For example, a DHCP server before its dependent SQL server).
Warnings:
Use high availability along with multipathed storage and bonded networking. Configure multipathed storage and bonded networking before attempting to set up high availability. Customers who do not set up multipathed storage and bonded networking can see unexpected host reboot behavior (Self-Fencing) when there is an infrastructure instability.
All graphics solutions (NVIDIA vGPU, Intel GVT-d, Intel GVT-G, and vGPU pass-through) can be used in an environment that uses high availability. However, VMs that use these graphics solutions cannot be protected with high availability. These VMs can be restarted on a best-effort basis while there are hosts with the appropriate free resources.
Overcommitting
A pool is overcommitted when the VMs that are currently running cannot be restarted elsewhere following a user-defined number of host failures.
Overcommitting can happen if there is not enough free memory across the pool to run those VMs following a failure. However, there are also more subtle changes which can make high availability guarantees unsustainable: Changes to Virtual Block Devices (VBDs) and networks can affect which VMs can be restarted on which hosts. Citrix Hypervisor cannot check all potential actions and determine if they cause violation of high availability demands. However, an asynchronous notification is sent if high availability becomes unsustainable.
Citrix Hypervisor dynamically maintains a failover plan which details what to do when a set of hosts in a pool fail at any given time. An important concept to understand is the host failures to tolerate value, which is defined as part of high availability configuration. The value of host failures to tolerate determines the number of failures that is allowed without any loss of service. For example, consider a resource pool that consists of 64 hosts and the tolerated failures is set to 3. In this case, the pool calculates a failover plan that allows any three hosts to fail and restart the VMs on other hosts. If a plan cannot be found, then the pool is considered to be overcommitted. The plan is dynamically recalculated based on VM lifecycle operations and movement. If changes, for example the addition on new VMs to the pool, cause your pool to become overcommitted, alerts are sent (either through Citrix Hypervisor Center or email).
Overcommitment warning
If any attempts to start or resume a VM cause the pool to be overcommitted, a warning alert is displayed. This warning appears in Citrix Hypervisor Center and is also available as a message instance through the management API. If you have configured an email address, a message can also be sent to the email address. You can then cancel the operation, or proceed anyway. Proceeding causes the pool to become overcommitted. The amount of memory used by VMs of different priorities is displayed at the pool and host levels.
Host fencing
Sometimes, a server can fail due to the loss of network connectivity or when a problem with the control stack is encountered. In such cases, the Citrix Hypervisor server self-fences to ensure the VMs are not running on two servers simultaneously. When a fence action is taken, the server restarts immediately and abruptly, causing all VMs running on it to be stopped. The other servers detect that the VMs are no longer running and the VMs are restarted according to the restart priorities assigned to them. The fenced server enters a reboot sequence, and when it has restarted it tries to rejoin the resource pool.
Note:
Hosts in clustered pools can also self-fence when they cannot communicate with more than half the other hosts in the resource pool. For more information, see Clustered pools.
Configuration requirements
To use the high availability feature, you need:
Citrix Hypervisor pool (this feature provides high availability at the server level within a single resource pool).
Note:
We recommend that you enable high availability only in pools that contain at least three Citrix Hypervisor servers. For more information, see CTX129721 - High Availability Behavior When the Heartbeat is Lost in a Pool.
Shared storage, including at least one iSCSI, NFS, or Fibre Channel LUN of size 356 MB or greater - the heartbeat SR. The high availability mechanism creates two volumes on the heartbeat SR:
4 MB heartbeat volume: Used to provide a heartbeat.
256 MB metadata volume: To store pool coordinator metadata to be used if there is a pool coordinator failover.
Notes:
For maximum reliability, we recommend that you use a dedicated NFS or iSCSI storage repository as your high availability heartbeat disk. Do not use this storage repository for any other purpose.
If your pool is a clustered pool, your heartbeat SR must be a GFS2 SR.
Storage attached using either SMB or iSCSI when authenticated using CHAP cannot be used as the heartbeat SR.
When using a NetApp or EqualLogic SR, manually provision an NFS or iSCSI LUN on the array to use as the heartbeat SR.
Static IP addresses for all hosts.
Warning:
If the IP address of a server changes while high availability is enabled, high availability assumes that the host’s network has failed. The change in IP address can fence the host and leave it in an unbootable state. To remedy this situation, disable high availability using the
host-emergency-ha-disable
command, reset the pool coordinator usingpool-emergency-reset-master
, and then re-enable high availability.For maximum reliability, we recommend that you use a dedicated bonded interface as the high availability management network.
For a VM to be protected by high availability, it must be agile. It means the VM:
Must have its virtual disks on shared storage. You can use any type of shared storage. iSCSI, NFS, or Fibre Channel LUN is only required for the storage heartbeat and can be used for virtual disk storage.
Can use live migration
Does not have a connection to a local DVD drive configured
Has its virtual network interfaces on pool-wide networks
Note:
When high availability is enabled, we strongly recommend using a bonded management interface on the servers in the pool and multipathed storage for the heartbeat SR.
If you create VLANs and bonded interfaces from the CLI, then they might not be plugged in and active despite being created. In this situation, a VM can appear to be not agile and it is not protected by high availability. You can use the CLI pif-plug
command to bring up the VLAN and bond PIFs so that the VM can become agile. You can also determine precisely why a VM is not agile by using the xe diagnostic-vm-status
CLI command. This command analyzes its placement constraints, and you can take remedial action if necessary.
Restart configuration settings
Virtual machines can be considered protected, best-effort, or unprotected by high availability. The value of ha-restart-priority
defines whether a VM is treated as protected, best-effort, or unprotected. The restart behavior for VMs in each of these categories is different.
Protected
High availability guarantees to restart a protected VM that goes offline or whose host goes offline, provided the pool isn’t overcommitted and the VM is agile.
If a protected VM cannot be restarted when a server fails, high availability attempts to start the VM when there is extra capacity in a pool. Attempts to start the VM when there is extra capacity might now succeed.
ha-restart-priority
Value: restart
Best-effort
If the host of a best-effort VM goes offline, high availability attempts to restart the best-effort VM on another host. It makes this attempt only after all protected VMs have been successfully restarted. High availability makes only one attempt to restart a best-effort VM. If this attempt fails, high availability does not make further attempts to restart the VM.
ha-restart-priority
Value: best-effort
Unprotected
If an unprotected VM or the host it runs on is stopped, high availability does not attempt to restart the VM.
ha-restart-priority
Value: Value is an empty string
Note:
High availability never stops or migrates a running VM to free resources for a protected or best-effort VM to be restarted.
If the pool experiences server failures and the number of tolerable failures drops to zero, the protected VMs are not guaranteed to restart. In such cases, a system alert is generated. If another failure occurs, all VMs that have a restart priority set behave according to the best-effort behavior.
Start order
The start order is the order in which Citrix Hypervisor high availability attempts to restart protected VMs when a failure occurs. The values of the order
property for each of the protected VMs determines the start order.
The order
property of a VM is used by high availability and also by other features that start and shut down VMs. Any VM can have the order
property set, not just the VMs marked as protected for high availability. However, high availability uses the order
property for protected VMs only.
The value of the order
property is an integer. The default value is 0, which is the highest priority. Protected VMs with an order
value of 0 are restarted first by high availability. The higher the value of the order
property, the later in the sequence the VM is restarted.
You can set the value of the order
property of a VM by using the command-line interface:
xe vm-param-set uuid=VM_UUID order=int
<!--NeedCopy-->
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论