This page explains how to install a 2-node promotable (single-master) cluster on Fedora 19 using PostgreSQL synchronous streaming replication which is implemented since version 9.1.
== Preliminary Note ==
=== Features ===
* Fail-over of Master
** If Master breaks down, RA detects this fault, stops Master, and promotes replica.
* Switching between asynchronous and synchronous
** If replica breaks down or LAN has some troubles, the transaction including Write operation will be stopped during the setting of synchronous replication. This means the stop of service. Therefore, RA switches dynamically from the synchronous to the asynchronous replication for prevented from stopping service.
* Automated discrimination of data old and new when initial starts
** When Pacemaker is started at the same time in the initial starts, RA compares the data of each node using "last xlog replay location" or "last receive replay location" to check which node has the newest data. The node which has the newest data will be Master.
* Load-balancing of read
** Because replica can process the Read only-transaction, the load-balancing of Read is possible by applying another virtual IP for the Read operation.
=== Terms ===
These terms are used in this document.
* Master,replica
** Corresponding to Promoted/Unpromoted roles of the promotable clone in Pacemaker
* PRI
** PostgreSQL works as Primary(Master). The request of Read/Write can be processed in PRI as well as usual PostgreSQL, and data for the replication can be transmitted. Master of Pacemaker is basically corresponding to this state. At a synchronous replication, The transaction is stopped when the response from Standby PostgreSQL(HS) is lost.
* HS
** PostgreSQL works as Hot Standby. Only the request of Read is available. The data of the replication can be received from PRI. Since PostgreSQL cannot change the state from PRI to HS directly, the state may not be consistent with Unpromoted role of Pacemaker after demoting though the state changes to Pacemaker with Master->replica.
* async mode
** Make the asynchronous replication of PostgreSQL HA cluster with RA.
* sync mode
** Make the synchronous replication of PostgreSQL HA cluster with RA. If parameter "rep_mode" is sync, HS works as the sync mode as a normal. When Master(PRI) detects HS is breaking down on monitor, Master switches replication mode to async mode automatically.
* replication mode
** The async mode and the sync mode are generically called a replication mode.It can be set by "rep_mode" parameter.
=== Software version ===
We developed pgsql RA using pacemaker 1.0.11 with heartbeat stack and linux distribution compatible with RHEL6.
It will probably work on other distributions but I recommend using newer pacemaker and replacing pgsql RA(https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql) if your resource-agents < 3.9.6.
=== Network Topology ===
It's recommended that LANs(for service, pacemaker, replication, stonith) are separated and redundant.
But we use these topology in this document to explain simply.
* node1
** eth0 : 192.168.0.1 : LAN for service
** eth1 : 192.168.1.1 : LAN for Pacemaker
** eth2 : 192.168.2.1 : LAN for replication
* node2
** eth0 : 192.168.0.2 : LAN for service
** eth1 : 192.168.1.2 : LAN for Pacemaker
** eth2 : 192.168.2.2 : LAN for replication
virtual IP (vip)
* virtual IP1 : 192.168.0.3 : vip for eth0 (DB client connects this IP to access PostgreSQL(Master))
* virtual IP2 : 192.168.2.3 : vip for eth2 (replica connects this IP to replicate)
[[File:Pg-rep-adressing-schema.png|300px|thumb|center|IP Adressing Schema]]
[[File:Pg-rep-failover.png|600px|thumb|center|Fail-over]]
=== PostgreSQL ===
Use PostgreSQL 9.1 or later.
=== Parameters of pgsql RA ===
The following parameters are added for replication.
* rep_mode
** choice from async or sync to use replication."async" is used for async mode only, "sync" is used for switching between sync mode and async mode.The following parameter node_list master_ip, and restore_command is necessary at async or sync modes(*).
* node_list(*)
** The list of PRI and HS node names. Specifies a space-separated list of all node name (result of the uname -n command).
* master_ip(*)
** HS connects to this IP. It equals virtual IP2 in this documents.
* restore_command(*)
** restore_command specified in recovery.conf file when starting with HS.
* repuser
** The user of replication which HS connects to PRI. Default is "postgres". Please use .pgpass file to set password.
* primary_conninfo_opt
** RA generates recovery.conf file to start PostgreSQL as HS. host,port,user and application name of primary_conninfo are automatically set by RA. If you would like to set some additional parameters, you can specify them here.
* tmpdir
** the rep_mode_conf and xlog_note.* and PGSQL.lock files are created in this directory. Default is /var/lib/pgsql/tmp directory. If the directory doesn't exist, RA makes it automatically.
* xlog_check_count
** The count of cheking last_xlog_replay_location or last_xlog_receive_location is specified to compare data. Default is 3(times). It is counted at monitor interval.
* stop_escalate_in_slave
** Number of shutdown retries (using -m fast) before resorting to -m immediate in replica state. In Master sate, you can use "stop_escalate" which is provided since early times.
* restart_on_promote
** RA restarts PostgreSQL on promote instead of promote to prevent from increasing Timeline ID of PostgreSQL since HS can't connect PRI if Timeline ID is different. Default is false and you should copy data from PRI to align Timeline ID after promoting.
== Installation ==
In this documents, we use Fedora 19.
=== OS (Fedora 19) (both nodes) ===
We don't explain details.
* disable selinux
** edit /etc/selinux/config
* disable firewall(iptables)
** systemctl stop firewalld.service
** systemctl disable firewalld.service
* set node name "node1" and "node2".
** edit /etc/sysconfig/network
* stop and disable NetworkManager and set up IPs as above.
** systemctl stop NetworkManager.service
** systemctl disable NetworkManager.service
** edit /etc/sysconfig/network-scripts/ifcfg-ethX
=== Packages (both nodes) ===
```
# yum -y install postgresql-server pacemaker corosync pcs
```
=== Replacement of pgsql RA (both nodes) ===
Replace pgsql RA because the RA of resource-agents-3.9.5-2.fc19.x86_64 has an bug
when using it as a promotable clone.
```
# wget https://raw.github.com/ClusterLabs/resource-agents/a6f4ddf76cb4bbc1b3df4c9b6632a6351b63c19e/heartbeat/pgsql
# cp pgsql /usr/lib/ocf/resource.d/heartbeat/
# chmod 755 /usr/lib/ocf/resource.d/heartbeat/pgsql
```
Note that please don't use newest pgsql RA because it needs new function ocf_local_nodename() that Fedora 19 doesn't have.
=== PostgreSQL (node1 only) ===
```
# su - postgres
$ mkdir /var/lib/pgsql/pg_archive
$ cd /var/lib/pgsql/data
$ initdb
```
Edit postgresql.conf.
* Point of postgresql.conf setting
** If there is "synchronous_standby_names" parameter, please delete it.
** Fixed IP cannot be written in listen_address.
** Replication_timeout is detection time for the replication cuts it, and wal_receiver_status_interval is an interval when HS tries connecting to PRI. To shorten detection, you should set this value to small.
** When using rep_mode=sync, RA adds "include" into postgresql.conf to switch replication mode. If you want to switch to rep_mode=async from sync, you need to delete it manually.
The main set part as follows. Please refer to the manual of PostgreSQL for other parameter.
Check the starting with the PostgreSQL unit, and the replication is possible.
```
listen_addresses = '*'
wal_level = hot_standby
synchronous_commit = on
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/pg_archive/%f'
max_wal_senders=5
wal_keep_segments = 32
hot_standby = on
restart_after_crash = off
replication_timeout = 5000
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
synchronous_commit = on
restart_after_crash = off
hot_standby_feedback = on
```
Edit pg_hba.conf.
Be careful. This explanation is not considered about the security.
```
host all all 127.0.0.1/32 trust
host all all 192.168.0.0/16 trust
host replication all 192.168.0.0/16 trust
```
start PostgreSQL on node1
```
$ pg_ctl -D /var/lib/pgsql/data start
```
=== PostgreSQL (node2 only) ===
Copy data from node1 to node2
```
# su - postgres
$ rm -rf /var/lib/pgsql/data/*
$ pg_basebackup -h 192.168.2.1 -U postgres -D /var/lib/pgsql/data -X stream -P
$ mkdir /var/lib/pgsql/pg_archive
```
Create /var/lib/pgsql/data/recovery.conf to confirm replication.
```
standby_mode = 'on'
primary_conninfo = 'host=192.168.2.1 port=5432 user=postgres application_name=node2'
restore_command = 'cp /var/lib/pgsql/pg_archive/%f %p'
recovery_target_timeline = 'latest'
```
Start PostgreSQL on node2
```
$ pg_ctl -D /var/lib/pgsql/data/ start
```
=== Confirm PostgreSQL replication success (node1 only) ===
```
# su - postgres
$ psql -c "select client_addr,sync_state from pg_stat_replication;"
client_addr | sync_state
---------------+------------
192.168.2.2 | async
```
=== Stop PostgreSQL (both nodes) ===
```
$ pg_ctl -D /var/lib/pgsql/data stop
$ exit
```
=== Corosync (both nodes) ===
Create /etc/corosync/corosync.conf
```
quorum {
provider: corosync_votequorum
expected_votes: 2
}
aisexec {
user: root
group: root
}
service {
name: pacemaker
ver: 0
}
totem {
version: 2
secauth: off
interface {
ringnumber: 0
bindnetaddr: 192.168.1.0
mcastaddr: 239.255.1.1
}
}
logging {
to_syslog: yes
}
```
Start corosync
```
# systemctl start corosync.service
```
You can see this log in /var/log/messages when you succeed in starting corosync.
```
Starting Corosync Cluster Engine (corosync): [ OK ]
```
=== Pacemaker (both nodes) ===
Clear current settings if it exists.
```
# rm -f /var/lib/pacemaker/cib/cib*
```
Start pacemaker
```
# systemctl start pacemaker.service
```
Check status
```
# crm_mon -Afr -1
Last updated: Mon Jul 8 09:46:27 2013
Last change: Mon Jul 8 09:46:27 2013 via crmd on node1
Stack: corosync
Current DC: node1 (402696384) - partition with quorum
Version: 1.1.9-3.fc19-781a388
2 Nodes configured, unknown expected votes
0 Resources configured.
Online: [ node1 node2 ]
Full list of resources:
Node Attributes:
* Node node1:
* Node node2:
Migration summary:
* Node node1:
* Node node2:
```
Make configuration file(config.pcs) for pcs command.
In this sample configuration, "vip-master" means virtual IP1 and "vip-rep" means virtual IP2,
and we use restart_on_promote="true" to explain operations simply.
If you use false, you should start pacemaker on node1 only and laod configuration.
After that you should copy data from node1 to node2 and start pacemaker on node2 to
align Timeline ID.
```
pcs cluster cib pgsql_cfg
pcs -f pgsql_cfg property set no-quorum-policy="ignore"
pcs -f pgsql_cfg property set stonith-enabled="false"
pcs -f pgsql_cfg resource defaults resource-stickiness="INFINITY"
pcs -f pgsql_cfg resource defaults migration-threshold="1"
pcs -f pgsql_cfg resource create vip-master IPaddr2 \
ip="192.168.0.3" \
nic="eth0" \
cidr_netmask="24" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
pcs -f pgsql_cfg resource create vip-rep IPaddr2 \
ip="192.168.2.3" \
nic="eth2" \
cidr_netmask="24" \
meta migration-threshold="0" \
op start timeout="60s" interval="0s" on-fail="stop" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="ignore"
pcs -f pgsql_cfg resource create pgsql pgsql \
pgctl="/usr/bin/pg_ctl" \
psql="/usr/bin/psql" \
pgdata="/var/lib/pgsql/data/" \
rep_mode="sync" \
node_list="node1 node2" \
restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
master_ip="192.168.2.3" \
restart_on_promote='true' \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="4s" on-fail="restart" \
op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \
op promote timeout="60s" interval="0s" on-fail="restart" \
op demote timeout="60s" interval="0s" on-fail="stop" \
op stop timeout="60s" interval="0s" on-fail="block" \
op notify timeout="60s" interval="0s"
pcs -f pgsql_cfg resource master msPostgresql pgsql \
master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
pcs -f pgsql_cfg resource group add master-group vip-master vip-rep
pcs -f pgsql_cfg constraint colocation add master-group with Master msPostgresql INFINITY
pcs -f pgsql_cfg constraint order promote msPostgresql then start master-group symmetrical=false score=INFINITY
pcs -f pgsql_cfg constraint order demote msPostgresql then stop master-group symmetrical=false score=0
pcs cluster push cib pgsql_cfg
```
Load configuration
```
# sh config.pcs
```
Check status again.
```
# crm_mon -Afr -1
Last updated: Mon Jul 8 10:24:21 2013
Last change: Mon Jul 8 10:22:14 2013 via crm_attribute on node1
Stack: corosync
Current DC: node2 (419473600) - partition with quorum
Version: 1.1.9-3.fc19-781a388
2 Nodes configured, unknown expected votes
4 Resources configured.
Online: [ node1 node2 ]
Full list of resources:
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node1
vip-rep (ocf::heartbeat:IPaddr2): Started node2
Clone Set: msPostgresql [pgsql] (promotable)
Promoted: [ node1 ]
Unpromoted: [ node2 ]
Node Attributes:
* Node node1:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000000009000080
+ pgsql-status : PRI
* Node node2:
+ master-pgsql : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : HS:sync
Migration summary:
* Node node1:
* Node node2:
```
== Operations ==
=== after fail-over ===
Kill PostgreSQL process at node1 to occur fail-over.
```
# killall -9 postgres
```
Check status
```
# crm_mon -Afr -1
Last updated: Mon Aug 26 19:12:33 2013
Last change: Mon Aug 26 19:12:32 2013 via crm_attribute on node2
Stack: corosync
Current DC: node2 (2566957248) - partition with quorum
Version: 1.1.9-3.fc19-781a388
2 Nodes configured, unknown expected votes
4 Resources configured.
Online: [ node1 node2 ]
Full list of resources:
Clone Set: msPostgresql [pgsql] (promotable)
Promoted: [ node2 ]
Stopped: [ node1 ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node2
vip-rep (ocf::heartbeat:IPaddr2): Started node2
Node Attributes:
* Node node1:
+ master-pgsql : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status : STOP
* Node node2:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 000000000F000000
+ pgsql-status : PRI
Migration summary:
* Node node1:
pgsql: migration-threshold=1000000 fail-count=1000000 last-failure='Mon Aug 26 19:12:26 2013'
* Node node2:
Failed actions:
pgsql_start_0 (node=node1, call=157, rc=1, status=complete, last-rc-change=Mon Aug 26 19:12:27 2013
, queued=402ms, exec=0ms
): unknown error
```
Recovery node1 as replica.
Copy all data from node2 because the data of node1 may be inconsistent.
(at node1)
```
# su - postgres
$ rm -rf /var/lib/pgsql/data/
$ pg_basebackup -h 192.168.2.3 -U postgres -D /var/lib/pgsql/data -X stream -P
$ rm /var/lib/pgsql/tmp/PGSQL.lock
$ exit
# pcs resource cleanup msPostgresql
```
Check status again.
```
# crm_mon -Afr -1
Last updated: Mon Aug 26 19:21:17 2013
Last change: Mon Aug 26 19:21:17 2013 via crm_attribute on node2
Stack: corosync
Current DC: node2 (2566957248) - partition with quorum
Version: 1.1.9-3.fc19-781a388
2 Nodes configured, unknown expected votes
4 Resources configured.
Online: [ node1 node2 ]
Full list of resources:
Clone Set: msPostgresql [pgsql] (promotable)
Promoted: [ node2 ]
Unpromoted: [ node1 ]
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Started node2
vip-rep (ocf::heartbeat:IPaddr2): Started node2
Node Attributes:
* Node node1:
+ master-pgsql : 100
+ pgsql-data-status : STREAMING|SYNC
+ pgsql-status : HS:sync
* Node node2:
+ master-pgsql : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 000000000F000000
+ pgsql-status : PRI
Migration summary:
* Node node1:
* Node node2:
```
== About node attributes ==
The RA defines the following states as a node attribute value of Pacemaker.
attribute can be seen in "crm_mon -A".
=== pgsql-status ===
A present state of PostgreSQL is displayed by the attribute value to which PRI or either HS node is displayed.
* STOP
** PostgreSQL has stopped.
* HS:alone
** It works as HS but HS doesn't connect PRI
* HS:connected
** it works as HS, and connected with PRI but HS isn't normal state of the replication (Data has not caught up with PRI).
* HS:async
** It works as HS, and the state is async mode. When RA is used in the rep_mode=async, it is possible to be promoted to Master.
* HS:sync
** It works as HS, and the state is sync mode. It is possible to be promoted to Master for the rep_mode=sync.
* PRI
** It operates by PRI.
=== pgsql-data-status ===
The transitional state of data is displayed. This state remains after stopping pacemaker.
When starting pacemaker next time, this state is used to judge whether my data is old or not.
* DISCONNECT
** Master changes other node state into DISCONNECT if Master can't detect connection of replication because of LAN failure or breakdown of replica and so on.
* {state}|{sync_state}
** Master changes other node state into {state}|{sync_state} if Master detects connection of replication.
** {state} and {sync_state} means state of replication which is retrieved using "select state and sync_state from pg_stat_replication" on Master.
** For example, INIT, CATCHUP, and STREAMING are displayed in {state} and ASYNC, SYNC are displayed in {sync_state}
* LATEST
** It's displayed when it's Master.
These states are the transitional state of final data, and it may be not consistent with the state of actual data.
For instance, During PRI, the state is "LATEST". But the node is stopped or down, this state "LATEST" is maintained if Master doesn't exist in other nodes. It never changes to "DISCONNECT" for oneself.
When other node newly is promoted, this new Master changes the state of old Master to "DISCONNECT". When any node can not become Master, this "LATEST" will be keeped.
=== pgsql-xlog-replay-loc ===
There is no Master node, it is displayed. RA decide to promote one node to Master comparing with the value of the last_replay_xlog_location or last_receive_xlog_replay_location among other node.
== appendix ==
=== sample settings for crm command ===
```
property \
no-quorum-policy="ignore" \
stonith-enabled="false" \
crmd-transition-delay="0s"
rsc_defaults \
resource-stickiness="INFINITY" \
migration-threshold="1"
ms msPostgresql pgsql \
meta \
master-max="1" \
master-node-max="1" \
clone-max="2" \
clone-node-max="1" \
notify="true"
clone clnPingCheck pingCheck
group master-group \
vip-master \
vip-rep
primitive vip-master ocf:heartbeat:IPaddr2 \
params \
ip="192.168.0.3" \
nic="eth0" \
cidr_netmask="24" \
op start timeout="60s" interval="0s" on-fail="stop" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
primitive vip-rep ocf:heartbeat:IPaddr2 \
params \
ip="192.168.2.3" \
nic="eth2" \
cidr_netmask="24" \
meta \
migration-threshold="0" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="10s" on-fail="restart" \
op stop timeout="60s" interval="0s" on-fail="block"
primitive pgsql ocf:heartbeat:pgsql \
params \
pgctl="/usr/bin/pg_ctl" \
psql="/usr/bin/psql" \
pgdata="/var/lib/pgsql/data/" \
start_opt="-p 5432" \
rep_mode="sync" \
node_list="node1 node2" \
restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
master_ip="192.168.2.3" \
restart_on_promote="true" \
op start timeout="60s" interval="0s" on-fail="restart" \
op monitor timeout="60s" interval="4s" on-fail="restart" \
op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \
op promote timeout="60s" interval="0s" on-fail="restart" \
op demote timeout="60s" interval="0s" on-fail="stop" \
op stop timeout="60s" interval="0s" on-fail="block" \
op notify timeout="60s" interval="0s"
colocation rsc_colocation-1 inf: master-group msPostgresql:Master
order rsc_order-1 0: msPostgresql:promote master-group:start symmetrical=false
order rsc_order-2 0: msPostgresql:demote master-group:stop symmetrical=false
```
You can load as follows.
# crm configure load update [filename]
=== Q&A ===
* How do I force start Master although pgsql-data-status is "DISCONNECT"?
** `crm_attribute -l forever -N {Node Name} -n "pgsql-data-status" -v "LATEST"`
* What's PGSQL.lock file?
** The file is created on promote. And it's deleted on demote only if replica does not exist. If this file remains in a node, it means that the data may be inconsistent. Please copy all data from PRI and delete this lock file.
* How do we stop all servers?
** First, stop replica. After that stop Master. If you stop Master in first, PGSQL.lock file remains.