Sunday, 11 August 2024

Switchover and Switchback - PostgreSQL-13

 

Below are setup details and the same will be used in this demonstration.

Sr No.HostnameIPRole
1postgres@rac09-p192.168.1.43Master / Primary Server
2postgres@rac10-p192.168.1.44Standby / Secondary Server


Basic Crosscheck :

postgres=# show listen_addresses;
 listen_addresses
------------------
 *
(1 row)

postgres=# show wal_level;
 wal_level
-----------
 replica
(1 row)

postgres=# show max_wal_senders;
 max_wal_senders
-----------------
 10
(1 row)

postgres=# show max_replication_slots;
 max_replication_slots
-----------------------
 10
(1 row)

postgres=# show max_wal_size;
 max_wal_size
--------------
 1GB
(1 row)


postgres=# show wal_log_hints;
 wal_log_hints
---------------
 off
(1 row)
postgres=#
postgres=# show archive_mode;
 archive_mode
--------------
 on
(1 row)



postgres=# show archive_command;
                                archive_command
-------------------------------------------------------------------------------
 test ! -f /backup/PostgreSQL13_arch/%f && cp %p //backup/PostgreSQL13_arch/%f
(1 row)



1. Check whether replication process and the sync status:

-- Run on master server(192.168.1.43 )

postgres=# \x
Expanded display is on.
postgres=#
postgres=#
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+---------------------------------
pid              | 16331
usesysid         | 16385
usename          | repuser
application_name | walreceiver
client_addr      | 192.168.1.44
client_hostname  |
client_port      | 23210
backend_start    | 2024-08-11 13:06:18.175803+05:30
backend_xmin     |
state            | streaming
sent_lsn         | 0/4019BC0
write_lsn        | 0/4019BC0
flush_lsn        | 0/4019BC0
replay_lsn       | 0/4019BC0
write_lag        |
flush_lag        |
replay_lag       |
sync_priority    | 0
sync_state       | async
reply_time       | 2024-08-11 18:50:57.943598+05:30




--Check the lag difference


postgres=# SELECT pg_wal_lsn_diff(sent_lsn, replay_lsn) from pg_stat_replication;
-[ RECORD 1 ]---+--
pg_wal_lsn_diff | 0



2. Check Replication Status on the secondary server:-

Now the replication is up and running. You can get the status of your replication from the pg_stat_wal_receiver table on the secondary server.

postgres=# SELECT "status", "last_msg_send_time", "slot_name", "sender_host" FROM pg_stat_wal_receiver;

-[ RECORD 1 ]------+---------------------------------
status             | streaming
last_msg_send_time | 2024-08-11 18:39:26.562249+05:30
slot_name          |
sender_host        | 192.168.1.43





postgres=# select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();
-[ RECORD 1 ]-----------------+---------------------------------
pg_last_wal_receive_lsn       | 0/4019BC0
pg_last_wal_replay_lsn        | 0/4019BC0
pg_last_xact_replay_timestamp | 2024-08-11 18:14:17.652363+05:30







3. Check the status of Recovery process on Slave database.



postgres=#  select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)




postgres=#
postgres=# select pg_current_wal_lsn();
ERROR:  recovery is in progress
HINT:  WAL control functions cannot be executed during recovery.
postgres=#
postgres=#
postgres=#


As our database is running in recovery process hence it is not allowing us to see the current WAL lsn number.





Now we will proceed with switchover

2. Shutdown the MASTER: [ ON MASTER SERVER 192.168.1.43 ]

[postgres@rac09-p ~]$ /usr/pgsql-13/bin/pg_ctl stop -D /var/lib/pgsql/13/data
waiting for server to shut down.... done
server stopped



2024-08-11 19:04:13.852 IST [24646] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.1.43" and accepting
                TCP/IP connections on port 5432?
2024-08-11 19:04:18.854 IST [24647] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.1.43" and accepting
                TCP/IP connections on port 5432?
2024-08-11 19:04:23.856 IST [24648] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.1.43" and accepting
                TCP/IP connections on port 5432?
2024-08-11 19:04:28.862 IST [24657] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.1.43" and accepting
                TCP/IP connections on port 5432?
2024-08-11 19:04:33.865 IST [24658] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
                Is the server running on host "192.168.1.43" and accepting
                TCP/IP connections on port 5432?





3. Promote slave as new master:[ RUN ON SLAVE SERVER 192.168.1.44 ]


[postgres@rac10-p ~]$
[postgres@rac10-p ~]$ /usr/pgsql-13/bin/pg_ctl promote -D /var/lib/pgsql/13/data
waiting for server to promote.... done
server promoted



2024-08-11 19:05:41.325 IST [13005] LOG:  received promote request
2024-08-11 19:05:41.325 IST [13005] LOG:  redo done at 0/5000028
2024-08-11 19:05:41.325 IST [13005] LOG:  last completed transaction was at log time 2024-08-11 18:14:17.652363+05:30
2024-08-11 19:05:41.328 IST [13005] LOG:  selected new timeline ID: 2
2024-08-11 19:05:41.361 IST [13005] LOG:  archive recovery complete
2024-08-11 19:05:41.367 IST [13003] LOG:  database system is ready to accept connections


We can see it has been opened with read , write with a new timeline id. But replication is broken. i.e currently we have only master server, no slave server.



So for the switchover to successful, we need to make old master as slave




4. On OLD MASTER( i.e NEW SLAVE 192.168.1.43)

 Create standby.signal on Old Primary

Create the standby.signal file on the old primary to ensure it starts as a standby when brought back online:


-- Create a standby.signal file(empty)

[postgres@rac09-p ~]$ touch /var/lib/pgsql/13/data/standby.signal


Edit postgresql.auto.conf on Old Primary - 192.168.1.43 -VIMP
[postgres@rac09-p data]$ cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
archive_mode = 'on'
archive_command = 'test ! -f /backup/PostgreSQL13_arch/%f && cp %p //backup/PostgreSQL13_arch/%f'
listen_addresses = '*'
primary_conninfo = 'user=repuser host=192.168.1.44 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'




5 . START NEW SLAVE ( 10.20.30.6):


/usr/pgsql-13/bin/pg_ctl start -D /var/lib/pgsql/13/data

[postgres@rac09-p data]$ /usr/pgsql-13/bin/pg_ctl start -D /var/lib/pgsql/13/data
waiting for server to start....2024-08-11 19:12:26.340 IST [32583] LOG:  redirecting log output to logging collector process
2024-08-11 19:12:26.340 IST [32583] HINT:  Future log output will appear in directory "log".
 done
server started


Monitor Logs

Keep an eye on the logs to monitor the status of the database. Use the following command to view the logs:
The logs will have something like “Starting as a standby”


2024-08-11 19:42:46.949 IST [33947] LOG: starting PostgreSQL 13.16 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22), 64-bit

2024-08-11 19:42:46.950 IST [33947] LOG: listening on IPv4 address "0.0.0.0", port 5432

2024-08-11 19:42:46.950 IST [33947] LOG: listening on IPv6 address "::", port 5432

2024-08-11 19:42:46.953 IST [33947] LOG: listening on Unix socket "/run/postgresql/.s.PGSQL.5432"

2024-08-11 19:42:46.955 IST [33947] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"

2024-08-11 19:42:46.958 IST [33949] LOG: database system was shut down in recovery at 2024-08-11 19:29:38 IST

2024-08-11 19:42:46.958 IST [33949] LOG: entering standby mode

2024-08-11 19:42:46.961 IST [33949] LOG: consistent recovery state reached at 0/50000A0

2024-08-11 19:42:46.961 IST [33949] LOG: invalid record length at 0/50000A0: wanted 24, got 0

2024-08-11 19:42:46.961 IST [33947] LOG: database system is ready to accept read only connections

2024-08-11 19:42:48.164 IST [33953] LOG: fetching timeline history file for timeline 2 from primary server

2024-08-11 19:42:48.167 IST [33953] LOG: started streaming WAL from primary at 0/5000000 on timeline 1

2024-08-11 19:42:48.168 IST [33953] LOG: replication terminated by primary server

2024-08-11 19:42:48.168 IST [33953] DETAIL: End of WAL reached on timeline 1 at 0/50000A0.

2024-08-11 19:42:48.169 IST [33949] LOG: new target timeline is 2

2024-08-11 19:42:48.170 IST [33953] LOG: restarted WAL streaming at 0/5000000 on timeline 2

2024-08-11 19:42:48.202 IST [33949] LOG: redo starts at 0/50000A0



6 . Check whether master-slave replication is working fine or not:

Verify switchover

Once the configurations are completed it’s imported to verify things. We will use few methods to verify the switchover process.

  1. Verify the switchover using the available views.
    First we will check if the server is receiving wals or not run below query on new slave:
    select * from pg_stat_wal_receiver;
    select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp()
    ;



postgres=# select * from pg_stat_wal_receiver;

-[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------

pid                   | 33953

status                | streaming

receive_start_lsn     | 0/5000000

receive_start_tli     | 2

written_lsn           | 0/50032A0

flushed_lsn           | 0/50032A0

received_tli          | 2

last_msg_send_time    | 2024-08-11 22:38:05.809966+05:30

last_msg_receipt_time | 2024-08-11 22:38:05.950297+05:30

latest_end_lsn        | 0/50032A0

latest_end_time       | 2024-08-11 22:34:35.554242+05:30

slot_name             |

sender_host           | 192.168.1.44

sender_port           | 5432

conninfo              | user=repuser host=192.168.1.44 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any



postgres=# select pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();

-[ RECORD 1 ]-----------------+---------------------------------

pg_last_wal_receive_lsn       | 0/50032A0

pg_last_wal_replay_lsn        | 0/50032A0

pg_last_xact_replay_timestamp | 2024-08-11 22:29:28.892188+05:30



Verify by matching the tables in both DB’s
Log into the new live (Primary) database and create a testing schema, tables and insert some records for the verify.

CREATE TABLE test_schema.test_table (
id serial PRIMARY KEY,
name varchar(100),
age int
);

INSERT INTO test_schema.test_table (name, age) VALUES
('Alice', 28),
('Bob', 32),
('Carol', 24);

Now log into the new Slave server and verify the inserted records, the records should be visible.

3. Check the new slave is Read-only or not (It should be read-only)

SELECT pg_is_in_recovery();
This query will return a single boolean value:
true if the server is a standby and false if it's the primary.


postgres=# SELECT pg_is_in_recovery();

-[ RECORD 1 ]-----+--

pg_is_in_recovery | t




Conclusion

By following these steps, you can successfully perform a switchover in PostgreSQL, changing the IP addresses involved and ensuring a seamless transition from the old primary to the new standby.

Remember that a well-managed PostgreSQL environment requires regular monitoring, maintenance, and backups. With this guide, you’re now equipped to build a robust and highly available PostgreSQL setup that can handle your organization’s critical data with confidence.





No comments:

Post a Comment

Master and Slave - Sync check - PostgreSQL

  1) Run the below Query on Primary:- SELECT     pid,     usename,     application_name,     client_addr,     state,     sync_state,     sen...