High availability 编辑

High availability is a set of automatic features designed to plan for, and safely recover from issues which take down Citrix Hypervisor servers or make them unreachable. For example, during physically disrupted networking or host hardware failures.

Overview

High availability ensures that when a host becomes unreachable or unstable, VMs running on that host are shut down and restarted on another host. Shutting down and restarting VMs on another host avoids the VMs being started (manually or automatically) on a new host. At some point later, the original host is recovered. This scenario can lead two instances of the same VM running on different hosts, and a corresponding high probability of VM disk corruption and data loss.

When the pool coordinator becomes unreachable or unstable, high availability can also recover administrative control of a pool. High availability ensures that administrative control is restored automatically without any manual intervention.

Optionally, high availability can also automate the process of restarting VMs on hosts which are known to be in a good state without manual intervention. These VMs can be scheduled for restart in groups to allow time to start services. It allows infrastructure VMs to be started before their dependent VMs (For example, a DHCP server before its dependent SQL server).

Warnings:

Use high availability along with multipathed storage and bonded networking. Configure multipathed storage and bonded networking before attempting to set up high availability. Customers who do not set up multipathed storage and bonded networking can see unexpected host reboot behavior (Self-Fencing) when there is an infrastructure instability.

All graphics solutions (NVIDIA vGPU, Intel GVT-d, Intel GVT-G, and vGPU pass-through) can be used in an environment that uses high availability. However, VMs that use these graphics solutions cannot be protected with high availability. These VMs can be restarted on a best-effort basis while there are hosts with the appropriate free resources.

Overcommitting

A pool is overcommitted when the VMs that are currently running cannot be restarted elsewhere following a user-defined number of host failures.

Overcommitting can happen if there is not enough free memory across the pool to run those VMs following a failure. However, there are also more subtle changes which can make high availability guarantees unsustainable: Changes to Virtual Block Devices (VBDs) and networks can affect which VMs can be restarted on which hosts. Citrix Hypervisor cannot check all potential actions and determine if they cause violation of high availability demands. However, an asynchronous notification is sent if high availability becomes unsustainable.

Citrix Hypervisor dynamically maintains a failover plan which details what to do when a set of hosts in a pool fail at any given time. An important concept to understand is the host failures to tolerate value, which is defined as part of high availability configuration. The value of host failures to tolerate determines the number of failures that is allowed without any loss of service. For example, consider a resource pool that consists of 64 hosts and the tolerated failures is set to 3. In this case, the pool calculates a failover plan that allows any three hosts to fail and restart the VMs on other hosts. If a plan cannot be found, then the pool is considered to be overcommitted. The plan is dynamically recalculated based on VM lifecycle operations and movement. If changes, for example the addition on new VMs to the pool, cause your pool to become overcommitted, alerts are sent (either through Citrix Hypervisor Center or email).

Overcommitment warning

If any attempts to start or resume a VM cause the pool to be overcommitted, a warning alert is displayed. This warning appears in Citrix Hypervisor Center and is also available as a message instance through the management API. If you have configured an email address, a message can also be sent to the email address. You can then cancel the operation, or proceed anyway. Proceeding causes the pool to become overcommitted. The amount of memory used by VMs of different priorities is displayed at the pool and host levels.

Host fencing

Sometimes, a server can fail due to the loss of network connectivity or when a problem with the control stack is encountered. In such cases, the Citrix Hypervisor server self-fences to ensure the VMs are not running on two servers simultaneously. When a fence action is taken, the server restarts immediately and abruptly, causing all VMs running on it to be stopped. The other servers detect that the VMs are no longer running and the VMs are restarted according to the restart priorities assigned to them. The fenced server enters a reboot sequence, and when it has restarted it tries to rejoin the resource pool.

Note:

Hosts in clustered pools can also self-fence when they cannot communicate with more than half the other hosts in the resource pool. For more information, see Clustered pools.

Configuration requirements

To use the high availability feature, you need:

  • Citrix Hypervisor pool (this feature provides high availability at the server level within a single resource pool).

    Note:

    We recommend that you enable high availability only in pools that contain at least three Citrix Hypervisor servers. For more information, see CTX129721 - High Availability Behavior When the Heartbeat is Lost in a Pool.

  • Shared storage, including at least one iSCSI, NFS, or Fibre Channel LUN of size 356 MB or greater - the heartbeat SR. The high availability mechanism creates two volumes on the heartbeat SR:

    4 MB heartbeat volume: Used to provide a heartbeat.

    256 MB metadata volume: To store pool coordinator metadata to be used if there is a pool coordinator failover.

    Notes:

    For maximum reliability, we recommend that you use a dedicated NFS or iSCSI storage repository as your high availability heartbeat disk. Do not use this storage repository for any other purpose.

    If your pool is a clustered pool, your heartbeat SR must be a GFS2 SR.

    Storage attached using either SMB or iSCSI when authenticated using CHAP cannot be used as the heartbeat SR.

    When using a NetApp or EqualLogic SR, manually provision an NFS or iSCSI LUN on the array to use as the heartbeat SR.

  • Static IP addresses for all hosts.

    Warning:

    If the IP address of a server changes while high availability is enabled, high availability assumes that the host’s network has failed. The change in IP address can fence the host and leave it in an unbootable state. To remedy this situation, disable high availability using the host-emergency-ha-disable command, reset the pool coordinator using pool-emergency-reset-master, and then re-enable high availability.

  • For maximum reliability, we recommend that you use a dedicated bonded interface as the high availability management network.

For a VM to be protected by high availability, it must be agile. It means the VM:

  • Must have its virtual disks on shared storage. You can use any type of shared storage. iSCSI, NFS, or Fibre Channel LUN is only required for the storage heartbeat and can be used for virtual disk storage.

  • Can use live migration

  • Does not have a connection to a local DVD drive configured

  • Has its virtual network interfaces on pool-wide networks

Note:

When high availability is enabled, we strongly recommend using a bonded management interface on the servers in the pool and multipathed storage for the heartbeat SR.

If you create VLANs and bonded interfaces from the CLI, then they might not be plugged in and active despite being created. In this situation, a VM can appear to be not agile and it is not protected by high availability. You can use the CLI pif-plug command to bring up the VLAN and bond PIFs so that the VM can become agile. You can also determine precisely why a VM is not agile by using the xe diagnostic-vm-status CLI command. This command analyzes its placement constraints, and you can take remedial action if necessary.

Restart configuration settings

Virtual machines can be considered protected, best-effort, or unprotected by high availability. The value of ha-restart-priority defines whether a VM is treated as protected, best-effort, or unprotected. The restart behavior for VMs in each of these categories is different.

Protected

High availability guarantees to restart a protected VM that goes offline or whose host goes offline, provided the pool isn’t overcommitted and the VM is agile.

If a protected VM cannot be restarted when a server fails, high availability attempts to start the VM when there is extra capacity in a pool. Attempts to start the VM when there is extra capacity might now succeed.

ha-restart-priority Value: restart

Best-effort

If the host of a best-effort VM goes offline, high availability attempts to restart the best-effort VM on another host. It makes this attempt only after all protected VMs have been successfully restarted. High availability makes only one attempt to restart a best-effort VM. If this attempt fails, high availability does not make further attempts to restart the VM.

ha-restart-priority Value: best-effort

Unprotected

If an unprotected VM or the host it runs on is stopped, high availability does not attempt to restart the VM.

ha-restart-priority Value: Value is an empty string

Note:

High availability never stops or migrates a running VM to free resources for a protected or best-effort VM to be restarted.

If the pool experiences server failures and the number of tolerable failures drops to zero, the protected VMs are not guaranteed to restart. In such cases, a system alert is generated. If another failure occurs, all VMs that have a restart priority set behave according to the best-effort behavior.

Start order

The start order is the order in which Citrix Hypervisor high availability attempts to restart protected VMs when a failure occurs. The values of the order property for each of the protected VMs determines the start order.

The order property of a VM is used by high availability and also by other features that start and shut down VMs. Any VM can have the order property set, not just the VMs marked as protected for high availability. However, high availability uses the order property for protected VMs only.

The value of the order property is an integer. The default value is 0, which is the highest priority. Protected VMs with an order value of 0 are restarted first by high availability. The higher the value of the order property, the later in the sequence the VM is restarted.

You can set the value of the order property of a VM by using the command-line interface:

xe vm-param-set uuid=VM_UUID order=int
<!--NeedCopy-->

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据

词条统计

浏览:48 次

字数:13342

最后编辑:8年前

编辑次数:0 次

    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文