> ## Documentation Index
> Fetch the complete documentation index at: https://docs.rivestack.io/llms.txt
> Use this file to discover all available pages before exploring further.

# High Availability

> How automatic failover keeps your database online

## Overview

Rivestack HA clusters use [Patroni](https://github.com/patroni/patroni) for automatic failover and cluster management. When a primary node fails, a replica is promoted within seconds — no manual intervention required.

<Info>Automatic failover requires an HA cluster (2+ nodes). The [Free](/pricing) and [Solo](/pricing) plans run a single node and do not include failover. Solo is a good fit when you want dedicated, backed-up PostgreSQL but don't yet need multi-node resilience.</Info>

## Architecture

Each HA cluster consists of:

* **1-3 PostgreSQL nodes** — one primary and zero or more replicas for failover
* **2 load balancers** — route connections to the correct node with a floating VIP (Virtual IP)
* **etcd** — distributed consensus store for leader election
* **Patroni** — cluster manager that handles replication, failover, and health checks

```
                    ┌──────────────┐
   Clients ────────►│Load Balancer │
                    │  (VIP: 5432) │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
         ┌─────────┐ ┌─────────┐ ┌─────────┐
         │ Primary  │ │ Replica │ │ Replica │
         │ (node 1) │ │ (node 2)│ │ (node 3)│
         └─────────┘ └─────────┘ └─────────┘
              │            ▲            ▲
              └──── streaming replication──┘
```

## How failover works

1. **Detection** — Patroni continuously monitors the health of each PostgreSQL node. If the primary becomes unresponsive, Patroni detects the failure within seconds.

2. **Leader election** — Patroni uses etcd to coordinate leader election. The healthiest replica with the least replication lag is selected as the new primary.

3. **Promotion** — The selected replica is promoted to primary. It begins accepting writes immediately.

4. **Routing** — The load balancer detects the topology change and routes new connections to the new primary. Existing connections to the old primary are terminated.

5. **Recovery** — When the failed node comes back online, it automatically rejoins the cluster as a replica and begins streaming from the new primary.

## Failover expectations

| Metric                  | Value                                                |
| ----------------------- | ---------------------------------------------------- |
| **Detection time**      | \~5-10 seconds                                       |
| **Promotion time**      | \~5-15 seconds                                       |
| **Total failover time** | \~10-30 seconds                                      |
| **Data loss**           | Zero (synchronous commit with streaming replication) |

<Info>During failover, active connections to the old primary will be dropped. Your application should implement connection retry logic to handle brief interruptions.</Info>

## Replication

Rivestack uses PostgreSQL **streaming replication** to keep replicas in sync with the primary:

* WAL (Write-Ahead Log) records are streamed from primary to replicas in real time.
* Replicas are used for automatic failover.
* Replication lag is monitored and visible in the dashboard Metrics tab.

### Replication lag

Monitor replication lag for each replica in the dashboard:

* **Lag time** — How far behind the replica is (in seconds)
* **Lag bytes** — Size of un-replayed WAL data

Under normal conditions, replication lag is sub-second. High write throughput or network issues can temporarily increase lag.

## Node roles

| Role        | Reads         | Writes | Count     |
| ----------- | ------------- | ------ | --------- |
| **Primary** | Yes           | Yes    | Exactly 1 |
| **Replica** | Failover only | No     | 0-2       |

## Connection routing

All connections go through port `5432`. The load balancer automatically routes traffic to the current primary and handles failover transparently — no configuration needed.

## Best practices

<AccordionGroup>
  <Accordion title="Implement connection retry logic">
    During failover, connections to the old primary are terminated. Your application should:

    * Catch connection errors and retry after a short delay (1-3 seconds).
    * Use exponential backoff for retries.
    * Most PostgreSQL drivers handle reconnection automatically if configured.
  </Accordion>

  <Accordion title="Use at least 2 nodes">
    A single-node cluster has no failover target. Run at least 2 nodes in production to ensure automatic recovery from node failures.
  </Accordion>

  <Accordion title="Monitor replication lag">
    Check the Metrics tab regularly. Sustained replication lag may indicate the replica needs more resources or there are network issues.
  </Accordion>

  <Accordion title="Test failover">
    Verify your application handles failover gracefully by observing behavior during planned maintenance windows.
  </Accordion>
</AccordionGroup>
