Blog

Immer aktuell informiert sein in unserem Blog!

With VCD 9.7 Appliance embedded PostgresSQL Database is introduced in HA configuration.

This article describes how to recover from a primary or standby database failure in a High Availability Cluster.

1. Desired State of the High Availability Cluster
The current state of the Cluster can be checked through the “vCloud Director Appliance Management” page, which is available under “https://primary_cell_ip_address:5480″.

If you are able to login to the console of a cell, it is also possible to get the current state with the following command.

sudo -n -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Connection string----+-------+---------+-----------+----------+----------+----------------------------------------------- 3889 | vcd04 | standby |   running | vcd02    | default  | host=192.168.194.14 user=repmgr dbname=repmgr 5066 | vcd01 | standby |   running | vcd02    | default  | host=192.168.194.11 user=repmgr dbname=repmgr 28278 | vcd02 | primary | * running |          | default  | host=192.168.194.12 user=repmgr dbname=repmgr

 

2. Example with some failed components of the HA Cluster.
Overview from the “vCloud Director Appliance Management” page:

Output from the cell console:

sudo -n -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show ID | Name  | Role    | Status        | Upstream | Location | Connection string----+-------+---------+---------------+----------+----------+----------------------------------------------- 3889 | vcd04 | standby |   running     | vcd02    | default  | host=192.168.194.14 user=repmgr dbname=repmgr 14714 | vcd01 | primary | - failed      |          | default  | host=192.168.194.11 user=repmgr dbname=repmgr 27985 | vcd03 | standby | ? unreachable | vcd01    | default  | host=192.168.194.13 user=repmgr dbname=repmgr

As you can see in the above example one standby and one primary node of the HA Cluster is failed.

3. Recommended way to recover a primary node

  1. Log in as root to the appliance management user interface of a running standby cell, https://standby_cell_ip_address:5480.
  2. Promote one of the standby nodes which are still running as new primary node.
  3. Check if the promotion was successful and the vCloud Director Web UI under https://new_primary_cell_ip_address/cloud is available.
  4. Remove the failed appliance from vCenter
  5. Unregister the failed nodes from the HA Cluster. This step needs cell console access and is a prerequisite to be able to redeploy the failed appliance under the same name and address.
    sudo -n -u postgres /opt/vmware/vpostgres/current/bin/repmgr primary unregister  --node-id=14714INFO: node vcd01 (ID: 14714) was successfully unregisteredsudo -n -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Connection string----+-------+---------+-----------+----------+----------+----------------------------------------------- 3889 | vcd04 | standby |   running | vcd02    | default  | host=192.168.194.14 user=repmgr dbname=repmgr 28278 | vcd02 | primary | * running |          | default  | host=192.168.194.12 user=repmgr dbname=repmgr 27985 | vcd03 | standby | ? unreachable | vcd01    | default  | host=192.168.194.13 user=repmgr dbname=repmgr
  6. Deploy a new standby appliance. After you completed step 5, it is possible to deploy the new standby appliance with the same name and address as the failed one and removed appliance.
    After a successful deployment you should see the new deployed standby appliance in the Cluster HA status as again.

     

    sudo -n -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show ID | Name  | Role    | Status        | Upstream | Location | Connection string----+-------+---------+---------------+----------+----------+----------------------------------------------- 3889 | vcd04 | standby |   running     | vcd02    | default  | host=192.168.194.14 user=repmgr dbname=repmgr 5066 | vcd01 | standby |   running     | vcd02    | default  | host=192.168.194.11 user=repmgr dbname=repmgr 27985 | vcd03 | standby | ? unreachable | vcd01    | default  | host=192.168.194.13 user=repmgr dbname=repmgr 28278 | vcd02 | primary | * running     |          | default  | host=192.168.194.12 user=repmgr dbname=repmgr

4. Recommended way to recover a standby node

  1. Remove the failed appliance from vCenter
  2. Unregister the failed nodes from the HA Cluster. This step needs cell console access and is a prerequisite to be able to redeploy the failed appliance under the same name and address.
    sudo -n -u postgres /opt/vmware/vpostgres/current/bin/repmgr standby unregister  --node-id=27985INFO: connecting to local standbyINFO: connecting to primary databaseNOTICE: unregistering node 27985sudo -n -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Connection string----+-------+---------+-----------+----------+----------+----------------------------------------------- 3889 | vcd04 | standby |   running | vcd02    | default  | host=192.168.194.14 user=repmgr dbname=repmgr 5066 | vcd01 | standby |   running | vcd02 | default | host=192.168.194.11 user=repmgr dbname=repmgr 28278 | vcd02 | primary | * running |          | default  | host=192.168.194.12 user=repmgr dbname=repmgr
  3. Deploy a new standby appliance. After you completed step 5, it is possible to deploy the new standby appliance with the same name and address as the failed and removed appliance.
    After a successfull deployment you should see the new deployed standby appliance in the Cluster HA status as again.
    In my case I already deployed a new standby appliance with another name and address, just to check if this works as well. So my final output was the following.

     

    sudo -n -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Connection string----+-------+---------+-----------+----------+----------+----------------------------------------------- 3889 | vcd04 | standby |   running | vcd02    | default  | host=192.168.194.14 user=repmgr dbname=repmgr 5066 | vcd01 | standby |   running | vcd02    | default  | host=192.168.194.11 user=repmgr dbname=repmgr 28278 | vcd02 | primary | * running |          | default  | host=192.168.194.12 user=repmgr dbname=repmgr

     

You may also refer to the following documentation from VMware:

VMware documentaion how to recover from a primary Database failure

VMware documentaion how to check the status of  cells in an High Availability Database Cluster

Für weitere Fragen stehe ich gerne in den Kommentaren zur Verfügung.

Abboniere unseren Newsletter!