Valid since:
XL Deploy 7.1.0

As of XL Deploy 7.1.0, you can configure XL Deploy in a clustered active/hot-standby mode. Running XL Deploy in this mode ensures that you have a Highly Available (HA) XL Deploy.

Active/hot-standby configuration

Tip: If you do not want to use active/hot-standby mode, you can set up failover handling as described in Configure failover for XL Deploy.

Requirements

Using XL Deploy in active/hot-standby mode requires the following:

  • You must meet the standard system requirements for XL Deploy.

    Important: Active/hot-standby is not supported on Microsoft Windows.

  • The XL Deploy repository must be stored in an external database, using the instructions in this topic. If you are using the default XL Deploy configuration (which stores the repository in an embedded Derby database), you must migrate all of your data to an external database before you can start to use active/hot-standby.

  • You must use a load balancer that supports hot-standby and health checking through HTTP status codes. This topic describes how to set up the HAProxy load balancer.

  • You must have a shared filesystem (such as NFS) that both the active and standby XL Deploy nodes can reach. This will be used to store binary data (artifacts). As of XL Deploy 8.0, artifacts can alternatively be stored in the database, by setting xl.repository.artifacts.type = db in xl-deploy.conf, as described in this document.

  • The time on both XL Deploy nodes must be synchronized through an NTP server.

  • The firewall must allow traffic on port 2552 in the default setup. For more information, see optional cluster settings.

Limitation on HTTP session sharing and resiliency

In active/hot-standby mode, only one XL Deploy node is “active” at any given time. The nodes use a health REST endpoint (/deployit/ha/health) to tell the load balancer which node is currently the active one. The load balancer should always route users to the active node; calling a standby node directly will result in incorrect behavior.

XL Deploy does not share HTTP sessions among nodes. If the active XL Deploy node becomes unavailable:

  • All users will be logged out and will lose any work that was not stored in the database.
  • Any deployment or control tasks that were running on the previously active node must be manually recovered. Tasks that were previously running will not automatically be visible from the newly active node because this may lead to data corruption in split-brain scenarios.

Active/Hot-standby setup procedure

The initial active/hot-standby setup is:

  • A load balancer
  • A database server
  • Two XL Deploy servers

To set up an active/hot-standby cluster, you must manually configure each XL Deploy instance before starting:

  • Provide the correct database driver
  • Modify two configuration files:
    • system.conf
    • repository.conf (XL Deploy up to 7.6) or xl-deploy.conf (XL Deploy 8.0 and up)

Step 1 Configure external databases

Set up a database to be used as a repository in XL Deploy. The following external databases are recommended:

  • MySQL
  • PostgreSQL
  • Oracle 11g or 12c

The following set of SQL privileges are required (where applicable):

  • REFERENCES
  • INDEX
  • CREATE
  • DROP
  • SELECT, INSERT, UPDATE, DELETE

Provide JDBC drivers

Place the JAR file containing the JDBC driver of the selected database in the XL_DEPLOY_SERVER_HOME/lib directory. Download the JDBC database drivers:

{:.table .table-striped}
| Database   | JDBC drivers | Notes   |
| ---------- | ------------ | ------- |
| MySQL      | [Connector\J 5.1.30 driver download](http://dev.mysql.com/downloads/connector/j/) | None. |
| Oracle     | [JDBC driver downloads](http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html) | For Oracle 12c, use the 12.1.0.1 driver (`ojdbc7.jar`). It is recommended that you only use the thin drivers. For more information, see the [Oracle JDBC driver FAQ](http://www.oracle.com/technetwork/topics/jdbc-faq-090281.html). |
| PostgreSQL | [PostgreSQL JDBC driver](https://jdbc.postgresql.org/download.html)| None. |

Step 2 Configure the repository database in XL Deploy

When active/hot-standby is enabled, the repository database must be shared among all nodes. Ensure that every node has access to the shared repository database and is set up to connect to the same database using the same credentials. Artifacts can be stored in a separate location on a file system. This location must be shared among the XL Deploy nodes, example: by using NFS or by mounting the same volume.

XL Deploy version 8.0 and later

Important: The database must be created with the correct permissions before starting XL Deploy. For example, the user must have CRUD access.

Repository configuration in XL Deploy 8.0 and later requires changes in xl-deploy.conf, as described here.

For example:

xl {
    artifacts {
        type = "db"

        ### alternatively (make sure to share this file location across nodes e.g. via NFS):
        # type = "file"
        # root = "repository/artifacts"
    }

    database {
        db-driver-classname = "org.postgresql.Driver"
        db-url = "jdbc:postgresql://localhost:5432/xldrepo"
        db-username = "username"
        db-password = "password"
    }
}

Proceed to Step 3 Set up the cluster.

XL Deploy version 7.6 and earlier

Repository configuration in XL Deploy version 7.6 and earlier requires changes in repository.conf, as described here.

  1. Add the xl.repository.configuration property to the repository.conf configuration file, which uses the HOCON format. This property identifies the predefined repository configuration that you have chosen. Supported values are:

    Parameter Description
    default Default (single instance) configuration that uses an embedded Apache Derby database.
    mysql-standalone Single instance configuration that uses a MySQL database.
    mysql-cluster Cluster-ready configuration that uses a MySQL database.
    oracle-standalone Single instance configuration that uses an Oracle database.
    oracle-cluster Cluster-ready configuration that uses an Oracle database.
    postgresql-standalone Single instance configuration that uses a PostgreSQL database.
    postgresql-cluster Cluster-ready configuration that uses a PostgreSQL database.
  2. Add the following parameters to the xl.repository.persistence section of repository.conf:

    Parameter Description
    jdbcUrl JDBC URL that describes the database connection details; for example, "jdbc:oracle:thin:@oracle.hostname.com:1521:SID".
    username User name to use when logging into the database.
    password Password to use when logging into the database (after setup is complete, the password will be encrypted and stored in secured format).
    maxPoolSize Database connection pool size; suggested value is 20.
  3. Set xl.repository.jackrabbit.artifacts.location to a shared filesystem (such as NFS) location that all nodes can access. This is required for the storage of binary data (artifacts).

  4. Set xl.repository.cluster.nodeId to a unique value on each node. The value of xl.repository.cluster.nodeID is used to distinguish entries in the database for each running jackrabbit instance.

This is an example of the xl.repository configuration for a stand-alone database:

xl {
  repository {
    # placeholder for repository configuration overrides
    configuration = XLD_CONFIGURATION
    jackrabbit.artifacts.location = XLD_SHARED_LOCATION
    cluster.nodeId = XLD_HOSTNAME

    persistence {
      jdbcUrl = XLD_DB_REPOSITORY_URL
      username = XLD_DB_USER
      password = XLD_DB_PASS
      maxPoolSize = 20
    }
  }
}

Step 3 Set up the cluster

Additional active/hot-standby configuration settings must be provided in the XL_DEPLOY_SERVER_HOME/conf/system.conf file, which also uses the HOCON format.

On each node in this file:

  1. Enable clustering by setting cluster.mode to hot-standby.
  2. Provide database access to register active nodes to a membership table by adding a cluster.membership configuration containing the following keys:

    Parameter Description
    jdbc.url JDBC URL that describes the database connection details; for example, "jdbc:oracle:thin:@oracle.hostname.com:1521:SID".
    jdbc.username User name to use when logging into the database.
    jdbc.password Password to use when logging into the database (after setup is complete, the password will be encrypted and stored in secured format).

You can set up XL Deploy to reuse the same database URL, username, and password for both the cluster membership information and for the repository configuration as set in the xl-deploy.conf file.

This is a simple example:

cluster {
  mode = hot-standby

  membership {
    jdbc {
      url = "jdbc:mysql://db/xldrepo?useSSL=false"
      username = <my_username>
      password = <my_password>
    }
  }
}

For additional system.conf information, see Optional cluster settings.

Step 4 Set up the first node

  1. Open a command prompt and point to XL_DEPLOY_SERVER_HOME.
  2. Execute ./bin/run.sh -setup for Linux-based systems, or run.cmd -setup for Microsoft Windows.
  3. Follow the on-screen instructions.

Step 5 Prepare each node in the cluster

  1. Compress the distribution that you created in Step 3 Set up the cluster into a ZIP file.
  2. Copy the ZIP file to all other nodes and unzip each one.
  3. In XL Deploy 7.6 or earlier, edit the xl.repository.cluster.nodeId setting of the XL_DEPLOY_SERVER_HOME/conf/repository.conf file on each node. Update the values for the specific node.

Note: You do not need to run the server setup command on each node.

Step 6 Set up the load balancer

To use active/hot-standby, you must use a load balancer in front of the XL Deploy servers. The load balancer must check the /deployit/ha/health endpoint with a GET or a HEAD request to verify that the node is up.

This endpoint will return:

  • A 503 HTTP status code if this node is running as standby (non-active) node.
  • A 204 HTTP status code if this is the active node. All user traffic should be sent to this node.

Note: Performing a simple TCP check or GET operation on / is not sufficient. That will only determine if the node is running and will not indicate if the node is in standby mode.

For example, using HAProxy, you can add the following configuration:

backend default_service
  option httpchk get /deployit/ha/health HTTP/1.0
  server xl-deploy1 <XL_DEPLOY_1-ADDRESS>:4516 check inter 2000 rise 2 fall 3
  server xl-deploy2 <XL_DEPLOY_2-ADDRESS>:4516 check inter 2000 rise 2 fall 3

Step 7 Start the nodes

Starting with the first node that you configured, start XL Deploy on each node. Ensure that each node is fully up and running before starting the next one.

Handling unreachability and network partitions (“split-brain scenarios”)

Transient network failure can affect XL Deploy nodes that are in hot-standby mode. These nodes will continually check the reachability of other nodes by sending messages and expecting responses.

The cluster will detect the affected nodes as unreachable due to various reasons:

  • If a network experiences a temporary outage or congestion (example: a network partition)
  • If an XL Deploy node crashes or is overloaded and cannot respond in time.

It is not possible to distinguish between these occurrences from inside the cluster. One of the strategies described below must be chosen to decide how to deal with unreachable members.

Note: None of these strategies apply for graceful shutdown of XL Deploy nodes. During a graceful shutdown, the cluster will be properly notified by the leaving member and immediate handover will take place.

Strategy: Auto-remove unreachable nodes from cluster

XL Deploy uses the this strategy as a default to remove unreachable nodes from the cluster:

If one or more nodes are unreachable for 15 seconds, the cluster marks these nodes as down, assuming they crashed. If the active node was affected, one of the standby nodes will be activated automatically. The timeout of 15 seconds can be adjusted by setting cluster.akka.cluster.auto-down-unreachable-after in system.conf, see Optional cluster settings below.

Important: When a network partition is performed and persists for longer than the configured timeout, the formerly active node is still up and a second active node will be elected. This is called a split-brain scenario because two nodes are active. If the load balancer can still detect both active nodes, it may route traffic to either of the two nodes. This can result in an incorrect behavior or data loss.

Strategy: Keep the oldest member

When nodes cannot detect the original active node for a period of time, they will shut down to prevent two nodes being active at the same time. This strategy prevents data loss. If long-lasting network partition occurs or if the active node crashes unexpectedly, it can result in the shutting down of most or all cluster nodes.

Note: The graceful shutdown of XL Deploy nodes will continue to be handled using managed handover and a single new active node will be selected automatically.

To use this strategy, add the following settings in the cluster section of system.conf (and see Optional cluster settings below):

  akka.cluster {
    custom-downing.down-removal-margin = 10s
    custom-downing.stable-after = 10s
    downing-provider-class = "com.xebialabs.xlplatform.cluster.full.downing.OldestLeaderAutoDowningProvider"
  }

Strategy: Keep the majority

When network partition occurs, the partition with the majority of nodes will stay active and the nodes in the minority partition will shut down. This strategy requires an odd number of total nodes (example: at least 3) and guarantees that the majority of nodes stay active. If the currently active node is on the minority partition, an unnecessary handover can take place.

To use this strategy, add the following settings in the cluster section of system.conf (and see Optional cluster settings below):

  akka.cluster {
    custom-downing.down-removal-margin = 10s
    custom-downing.stable-after = 10s
    downing-provider-class = "com.xebialabs.xlplatform.cluster.full.downing.MajorityLeaderAutoDowningProvider"
  }

Strategy: Manual

If none of the above strategies apply to your scenario, you can use the manual strategy. To use this strategy, add this setting in the cluster section of system.conf (and see Optional cluster settings below):

    akka.cluster {
        auto-down-unreachable-after = off
    }

Manual monitoring and intervention are required, both can be achieved using JMX MBeans. The MBean akka.Cluster has an attribute named Unreachable that lists the cluster members that cannot be reached from the current node. Members are identified by IP address and akka communication port number (2552 by default , see Optional cluster settings below). If there are unreachable members, a log line will also appear in the XL Deploy logs periodically while the situation persists.

Note In this strategy, reachability of members will have no influence on any nodes being, becoming, or going active, in stand-by, or down. If the active node goes down (example: a crash and not a network partition), no other node will become active automatically. This results in longer apparent downtime of the overall active/hot-standby XL Deploy system.

If a cluster member is reported as unreachable on any (other) node, an operator must verify if that member is actually down or if this is only caused by a network partition.

If the node is down, the operator must remove it (de-register it) from the cluster manually. The same akka.Cluster MBean as above has a down(...) operation that must requires the IP address and communication port number (as from the Unreachable attribute). This must be done only on a single node of the remaining cluster. The information will spread throughout the cluster automatically. Once all unreachable members have been removed from the cluster, automatic election of a new active member will take place if necessary.

Sample system.conf configuration

This is a sample system.conf configuration for one node that uses a MySQL repository database to store its cluster membership information. Note: The XL Deploy repository is configured in xl-deploy.conf or repository.conf (depending on your XL Deploy version as described above) but it can reuse the same database settings.

    task {
      recovery-dir = work
      step {
        retry-delay = 5 seconds
        execution-threads = 32
      }
    }

    satellite {

      // satellite settings omitted in this example

    }

    cluster {
      mode = hot-standby

      membership {
        jdbc {
          url = "jdbc:mysql://db/xldrepo?useSSL=false"
          username = <my_username>
          password = <my_password>
        }
      }

      akka.cluster {
        auto-down-unreachable-after = off
        custom-downing.down-removal-margin = 10s
        custom-downing.stable-after = 10s
        downing-provider-class = "com.xebialabs.xlplatform.cluster.full.downing.MajorityLeaderAutoDowningProvider"
      }
    }

Optional cluster settings

You can optionally configure the following additional settings in the cluster section of system.conf:

Parameter Description Default value
name The hot-standby management Akka cluster name. xld-hotstandby-cluster
membership.jdbc.driver The database driver class name. For example, oracle.jdbc.OracleDriver. Determined from the database URL
membership.heartbeat How often a node should write liveness information into the database. 10 seconds
membership.ttl How long liveness information remains valid. 60 seconds
akka.remote.netty.tcp.hostname The hostname or IP that XL Deploy uses to communicate with the cluster Auto-determined
akka.remote.netty.tcp.port The port number that XL Deploy uses to communicate with the cluster 2552
akka.cluster.auto-down-unreachable-after The amount of time that passes before the Akka cluster determines that a node has gone down. 15 seconds
akka.cluster.downing-provider-class The strategy to use for handling network partitions akka.cluster.AutoDowning
akka.cluster.custom-downing.down-removal-margin The amount of time before the handover activates none
akka.cluster.custom-downing.stable-after How much time must pass before the network is marked stable none

Notes on the these settings:

  • The heartbeat and ttl settings are relevant for cluster bootstrapping. A newly starting node will search in the database to find live nodes and try to join the cluster with the given name running on those nodes.

  • The akka.remote.netty.tcp.hostname setting determines which IP address the node uses to register with the cluster. Use this when the node is on multiple networks or broadcasts its presence as 127.0.0.1:2552. Other nodes may not be able to see it resulting in multiple active nodes. Check your deployit.log for entries like this: 2018-08-24 13:08:06.799 [xld-hotstandby-cluster-akka.actor.default-dispatcher-17] {sourceThread=xld-hotstandby-cluster-akka.actor.default-dispatcher-17, akkaSource=akka.tcp://xld-hotstandby-cluster@127.0.0.1:2552/user/$b, sourceActorSystem=xld-hotstandby-cluster, akkaTimestamp=13:08:06.797UTC} INFO c.x.x.c.m.ClusterDiscoveryActor - Starting node registration

  • The auto-down-unreachable-after setting is used in the default unreachability handling strategy. This setting defines the time needed for a cluster to detect if a node is down. If the active node has been unreachable for this time, a standby node will be activated. If you change this setting to a smaller value, the hot-standby takeover will occur within a shorter time period. For transient network issues, this can cause a takeover while the original node is still alive. If you use a longer value, the cluster is more resilient against transient network failures. The takeover takes more time when a crash occurs. For more information, see Strategy: Auto-remove unreachable nodes from cluster. If you want manual control over the activation process, change the setting to off. For more information, see Strategy: Manual.

  • The downing-provider-class setting specifies which automatic strategy to use for handling network partitions. This setting is optional. The possible values are:
  • com.xebialabs.xlplatform.cluster.full.downing.OldestLeaderAutoDowningProvider, see Strategy: Keep the oldest member
  • com.xebialabs.xlplatform.cluster.full.downing.MajorityLeaderAutoDowningProvider, see Strategy: Keep the majority For both strategies, stable-after and down-removal-margin are required settings.

  • The stable-after and down-removal-margin settings determine when the networking partition handling strategy starts. When a network partition arises, there is a period when nodes are fluctuating between reachable and unreachable states. After a period of time situation will stabilize. The stable-after setting determines how much time to wait for additional reachable/unreachable notifications before determining that the current situation is stable. The down-removal-margin setting is an additional timeout in which proper handover will be arranged. The recommended value for these settings for clusters of up to ten nodes is 10s.

Note: After the first run, passwords in the configuration file will be encrypted and replaced with Base64-encoded values.

Using the official XL Deploy Docker images

The official XL Deploy Docker images can be configured to run in active/hot-standby mode by providing file mounts with configuration files containing custom settings.

Important Before running XL Deploy:

  1. On the directory where you want to mount the image, ensure that the conf folder contains a valid license, a system.conf file, and the xl-deploy.conf file with the hot-standby settings described above.
  2. Ensure that the hotfix/lib directory contains the configured database driver.

To run the official XL Deploy Docker images in active/hot-standby mode:

  1. Configure each image to have the same volume mounts for the conf, repository, and hotfix/lib directories. For more information about configuring Docker images, see Use the XL Deploy Docker images.
  2. Start and boot a single Docker container.
  3. After the container fully booted, start additional containers.

Example

Use the following filesystem structure:

    /somewhere
      |- xld-repo
      |- xld-libs
      |    - postgresql-42.2.4.jar
      |- xld-conf
           |- deployit-license.lic
           |- system.conf
           - xl-deploy.conf
  1. Start the first container:
  docker run -d -p 4516:4516 \
      -v /somewhere/xld-conf:/opt/xl-deploy-server/conf \
      -v /somewhere/xld-libs:/opt/xl-deploy-server/hotfix/lib \
      -v /somewhere/xld-repo:/opt/xl-deploy-server/repository \
      -e ADMIN_PASSWORD=T0pS3cret \
      --name xld xebialabs/xl-deploy:8.1
  1. After boot-up, start additional standby containers by using the above command. Change the container port 4516 to a different host port and assign it a name.

    Example: -p 4517:4516 --name xld2

Sample haproxy.cfg configuration

This is a sample haproxy.cfg configuration exposing proxy statistics on port 1936 (with credentials stats/stats). Change the bottom two lines to match your setup. Ensure that your configuration is hardened before using it in a production environment. For more information, see haproxy-dconv.

    global
      log 127.0.0.1 local0
      log 127.0.0.1 local1 notice
      log-send-hostname
      maxconn 4096
      pidfile /var/run/haproxy.pid
      user haproxy
      group haproxy
      daemon
      stats socket /var/run/haproxy.stats level admin
      ssl-default-bind-options no-sslv3
      ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:DHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:AES128-GCM-SHA256:AES128-SHA256:AES256-GCM-SHA384:AES256-SHA256
    defaults
      balance roundrobin
      log global
      mode http
      option redispatch
      option httplog
      option dontlognull
      option forwardfor
      timeout connect 5000
      timeout client 50000
      timeout server 50000
    listen stats
      bind :1936
      mode http
      stats enable
      timeout connect 10s
      timeout client 1m
      timeout server 1m
      stats hide-version
      stats realm Haproxy\ Statistics
      stats uri /
      stats auth stats:stats
    frontend default_port_80
      bind :80
      reqadd X-Forwarded-Proto:\ http
      maxconn 4096
      default_backend default_service
    backend default_service
      option httpchk head /deployit/ha/health HTTP/1.0
      server xl-deploy1 <XL_DEPLOY_1-ADDRESS>:4516 check inter 2000 rise 2 fall 3
      server xl-deploy2 <XL_DEPLOY_2-ADDRESS>:4516 check inter 2000 rise 2 fall 3