1) Check the log file
postgres=# show data_directory ;
data_directory
-------------------------
/u02/PostgreSQL/10/data
(1 row)
postgres=# show log_directory ;
log_directory
---------------
log
(1 row)
postgres=#
postgres=# show log_filename ;
log_filename
--------------------------------
postgresql-%Y-%m-%d_%H%M%S.log
(1 row)
postgres=#
postgres=# select pg_current_logfile();
pg_current_logfile
--------------------------------------
log/postgresql-2021-10-24_000000.log
(1 row)
postgres=# \q
-bash-4.2$ pwd
/u02/PostgreSQL/10/data/log
-bash-4.2$
-rw------- 1 postgres postgres 1676 Oct 24 19:41 postgresql-2021-10-24_000000.log
-bash-4.2$
-bash-4.2$ tail -500f postgresql-2021-10-24_000000.log
[unknown] 2021-10-24 10:30:30 IST [unknown] [unknown] 192.168.0.108(32647) LOG: connection received: host=192.168.0.108 port=32647
[unknown] 2021-10-24 10:30:31 IST replicator [unknown] 192.168.0.108(32647) LOG: replication connection authorized: user=replicator
[unknown] 2021-10-24 14:06:24 IST [unknown] [unknown] 192.168.0.108(32861) LOG: connection received: host=192.168.0.108 port=32861
[unknown] 2021-10-24 14:06:24 IST replicator [unknown] 192.168.0.108(32861) LOG: replication connection authorized: user=replicator
[unknown] 2021-10-24 19:39:34 IST [unknown] [unknown] [local] LOG: connection received: host=[local]
[unknown] 2021-10-24 19:39:34 IST postgres postgres [local] LOG: connection authorized: user=postgres database=postgres
psql 2021-10-24 19:39:40 IST postgres postgres [local] LOG: statement: show data_directory ;
psql 2021-10-24 19:39:40 IST postgres postgres [local] LOG: duration: 0.334 ms
psql 2021-10-24 19:39:50 IST postgres postgres [local] LOG: statement: show log_directory ;
psql 2021-10-24 19:39:50 IST postgres postgres [local] LOG: duration: 0.070 ms
[unknown] 2021-10-24 19:40:33 IST [unknown] [unknown] [local] LOG: connection received: host=[local]
[unknown] 2021-10-24 19:40:33 IST postgres postgres [local] LOG: connection authorized: user=postgres database=postgres
psql 2021-10-24 19:41:16 IST postgres postgres [local] LOG: statement: show log_filename ;
psql 2021-10-24 19:41:16 IST postgres postgres [local] LOG: duration: 0.214 ms
psql 2021-10-24 19:41:31 IST postgres postgres [local] LOG: statement: select pg_current_logfile();
psql 2021-10-24 19:41:31 IST postgres postgres [local] LOG: duration: 0.899 ms
PostgreSQL: Find slow, long-running, and Blocked Queries
For active queries, PostgreSQL has a “statistics collector” subsystem that collates data on table, server, query, index, and connection activity. The database exposes information through a number of “pg_stat” predefined views and some lower-level statistic functions for power users.
Additionally, there are several system information functions that can extract session and system information on processes and queries that are currently ongoing.
The key tables and functions that I’ve found useful are:
- pg_stat_activity: A table with one entry per server process, showing details of the running query for each.
- pg_locks: Information on current locks held within the database by open transactions, with one row per lockable object.
- pg_blocking_pids(): A function that can find the process IDs (PIDs) of sessions that are blocking the PostgreSQL server process of a supplied PID.
- pg_cancel_backend(): Function that cancels the currently running query by sending a SIGINT to a process ID.
- pg_terminate_backend(): Terminate a backend process completely (the query and usually the connection) on the database (uses SIGTERM instead of SIGINT).
List long-running and slow queries on PostgreSQL
Poorly written queries or poorly structured data can cause very long execution times on your database. Typically discovered through slow response or extended increases in database CPU, the pg_stat_activity
view can help to find out what query is causing issues. The pg_stat_activity view contains details of all currently running queries, including user, connection, and timing details.
A simple select * from pg_stat_activity
will provide a snapshot of what is happening on your PostgreSQL database, with one line per current transaction, and the key columns:
- datname: The database name that the query is running on.
- pid: The Process ID of the backend for the running query. PID can be used with pg_terminate_backend() or pg_cancel_backend() separately.
- usename: user name that is running/connected to this backend
- client_addr: IP address of the connection to this backend. If this is empty/null, the row may refer to an internal connection or internal process.
- backend_start: Time when the backend process was started, or when the client connected.
- query_start: Time when the currently active query was started.
- state: The current status of the backend, with options:
- active: currently executing the query in the query column.
- idle: not executing anything, and waiting for a new command.
- idle in transaction: the backend is in a transaction, but not currently doing any work. This is the same as with “idle in transaction (error)” except one of the statements has an error.
- fastpath function call: Executing a fast-path function.
- disabled: track_activities has been disabled for this backend.
- wait_event & wait_event_type: The event type that the backend is waiting for. This can be an extensive list. Watch for locks on database objects here.
- query: The text of the backends most recent, or currently active, query. As per the documentation, this is truncated at 1024 bytes, changeable with “track_activity_query_size” in the server configuration.
- backend_type: There’s a few different types, but for most connections from external clients, this will be “client backend”.
Find queries running longer than 5 minutes
One useful addition to this is to show all queries that have been running for longer than five minutes for debugging purposes:
Kill long-running PostgreSQL query processes
Where some queries look like they’re not going to finish, you can use the pid
(process ID) from the pg_stat_activity or pg_locks views to terminate the running process.
pg_cancel_backend(pid)
will attempt to gracefully kill a running query process.pg_terminate_backend(pid)
will immediately kill the running query process, but potentially have side affects across additional queries running on your database server. The full connection may be reset when running pg_terminate_backend, so other running queries can be affected. Use as a last resort.
Finding blocked processes and blocking queries
The pg_blocking_pids()
function is a useful shortcut to find the database connections / sessions that are blocking another session. The pg_blocking_pids() function returns an postgreSQL array of PIDs that are blocking the specified server process PID that you provide to the query. Typically, a server process blocks another if it holds a lock that is needed by the second process.
A useful function (from”How to detect query which holds the lock in Postgres?” Stackoverflow answer) to show the blocked processes on your database along with the actual query that is blocking them is:
Viewing Locks on Tables – pg_lock
The pg_lock view provides information about the locks held by active processes in the database, with a single entry per lockable object (which can be tables, pages, transaction IDs, and other database objects). Locks are acquired as part of transactions, and are typically used where it is pertinent that the data in a table or database object is not changed before the full transaction is completed.
A simple select * from pg_locks
will immediately provide you with (key columns):
- locktype: The type of object that is locked – (see here) relation, page, object, tranasctionid, userlock etc.
- relation: OID (object ID) of the relation targeted by the lock. Join to the
pg_class
OID to get the names of the tables here. - page: page number targeted by the lock within the relation.
- transactionid xid: ID of the transaction targeted by the lock.
- pid: Process ID of the server process holding/waiting for this lock. Joining this column onto the
pg_stat_activity
view can provide all of the information above on the query / user / time etc. - granted: True if lock is held, false if the process is waiting.
Viewing locks with table names and queries
Joining the pg_lock
view to pg_stat_activity
for the query and pg_class
for the table names can be useful to gather more context as to what is locking on your database at any point in time (for more see “PostgreSQL find locks including the table name” on StackOverflow):
How to find long running queries in PostgreSQL
As administrator, From time to time we need to investigate if there is any query running indefinitely on PostgreSQL. These long running queries may impact database performance and probably they are stuck on some background process.
Long running queries in PostgreSQL
1. Following queries will return currently running top 10 queries and longest running queries in the front.
select current_timestamp-query_start as runtime, datname,usename, query FROM pg_stat_activity where state='active' order by 1 desc limit 10;
2. To get list of queries that have been running more than 1 minute in PostgreSQL..
select current_timestamp-query_start as runtime, datname,usename, query FROM pg_stat_activity where state='active' and current_timestamp-query_start> '1 min' order by 1 desc limit 10;
3. To get list of queries that have been running more than 5 minutes in PostgreSQL.
select current_timestamp-query_start as runtime, datname,usename, query FROM pg_stat_activity where state='active' and current_timestamp-query_start> '5 min' order by 1 desc limit 10;
Pre 9.5 verions:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';
How to kill long running processes:
1. First try to cancel the query.
If you have proper approvals to cancel process, then execute following command.
SELECT pg_cancel_backend(__pid__);
or
SELECT pg_cancel_backend(pid);
2. After query cancelled, still query is present on the PostgreSQL server. If you have proper approvals to kill the process, then execute following command.
SELECT pg_terminate_backend(__pid__);
or
SELECT pg_terminate_backend(pid);
Query to check blocking in PostgreSQL Serve
Syntax to check blocking PostgreSQL Server:
SELECT datname , usename , wait_event_type , wait_event , pg_blocking_pids(pid) AS blocked_by_pid , query FROM pg_stat_activity WHERE wait_event IS NOT NULL;
Output is:
-[ RECORD 1 ]---+----------------- datname | postgres usename | james wait_event_type | Lock wait_event | relation blocked_by_pid | {15356} query | select * from address;
As a PostgreSQL Database Administrator, there are the everyday expectations to check on backups, apply DDL changes, make sure the logs don’t have any game breaking ERROR’s, and answer panicked calls from developers who’s reports are running twice as long as normal and they have a meeting in ten minutes.
Even with a good understanding of the health of managed databases, there will always be new cases and new issues popping up relating to performance and how the database “feels”. Whether it’s a panicked email, or an open ticket for “the database feels slow”, this common task can generally be followed with a few steps to check whether or not there is a problem with PostgreSQL, and what that problem may be.
This is by no extent an exhaustive guide, nor do the steps need to be done in any specific order. But it’s rather a set of initial steps that can be taken to help find the common offenders quickly, as well as gain new insight as to what the issue may be. A developer may know how the application acts and responds, but the Database Administrator knows how the database acts and responds to the application, and together, the issue can be found.
NOTE: The queries to be executed should be done as a superuser, such as ‘postgres’ or any database user granted the superuser permissions. Limited users will either be denied or have data omitted.
Step 0 - Information Gathering
Get as much information as possible from whoever says the database seems slow; specific queries, applications connected, timeframes of the performance slowness, etc. The more information they give the easier it will be to find the issue.
Step 1 - Check pg_stat_activity
The request may come in many different forms, but if “slowness” is the general issue, checking pg_stat_activity is the first step to understand just what’s going on. The pg_stat_activity view (documentation for every column in this view can be found here) contains a row for every server process / connection to the database from a client. There is a handful of useful information in this view that can help.
NOTE: pg_stat_activity has been known to change structure over time, refining the data it presents. Understanding of the columns themselves will help build queries dynamically as needed in the future.
Notable columns in pg_stat_activity are:
- query: a text column showing the query that’s currently being executed, waiting to be executed, or was last executed (depending on the state). This can help identify what query / queries a developer may be reporting are running slowly.
- client_addr: The IP address for which this connection and query originated from. If empty (or Null), it originated from localhost.
- backend_start, xact_start, query_start: These three provide a timestamp of when each started respectively. Backend_start represents when the connection to the database was established, xact_start is when the current transaction started, and query_start is when the current (or last) query started.
- state: The state of the connection to the database. Active means it’s currently executing a query, ‘idle’ means it’s waiting further input from the client, ‘idle in transaction’ means it’s waiting for further input from the client while holding an open transaction. (There are others, however their likelihood is rare, consult the documentation for more information).
- datname: The name of the database the connection is currently connected to. In multiple database clusters, this can help isolate problematic connections.
- wait_event_type and wait_event: These columns will be null when a query isn’t waiting, but if it is waiting they will contain information on why the query is waiting, and exploring pg_locks can identify what it’s waiting on. (PostgreSQL 9.5 and before only has a boolean column called ‘waiting’, true if waiting, false if not.
1.1. Is the query waiting / blocked?
If there is a specific query or queries that are “slow” or “hung”, check to see if they are waiting for another query to complete. Due to relation locking, other queries can lock a table and not let any other queries to access or change data until that query or transaction is done.
PostgreSQL 9.5 and earlier:
1 | SELECT * FROM pg_stat_activity WHERE waiting = TRUE ; |
PostgreSQL 9.6:
1 | SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL ; |
PostgreSQL 10 and later (?):
1 | SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL AND backend_type = 'client backend' ; |
The results of this query will show any connections currently waiting on another connection to release locks on a relation that is needed.
If the query is blocked by another connection, there are some ways to find out just what they are. In PostgreSQL 9.6 and later, the function pg_blocking_pids() allows the input of a process ID that’s being blocked, and it will return an array of process ID’s that are responsible for blocking it.
PostgreSQL 9.6 and later:
1 2 | SELECT * FROM pg_stat_activity WHERE pid IN ( SELECT pg_blocking_pids(<pid of blocked query>)); |
PostgreSQL 9.5 and earlier:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | SELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement, blocking_activity.query AS current_statement_in_blocking_process FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks. DATABASE IS NOT DISTINCT FROM blocked_locks. DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED; |
(Available from the PostgreSQL Wiki).
These queries will point to whatever is blocking a specific PID that’s provided. With that, a decision can be made to kill the blocking query or connection, or let it run.
Step 2 - If the queries are running, why are they taking so long?
2.1. Is the planner running queries efficiently?
If a query (or set of queries) in question has the status of ‘active’, then it’s actually running. If the whole query isn’t available in pg_stat_activity, fetch it from the developers or the postgresql log and start exploring the query planner.
1 2 3 4 5 6 7 8 9 10 11 12 13 | EXPLAIN SELECT * FROM postgres_stats.table_stats t JOIN hosts h ON (t.host_id = h.host_id) WHERE logged_date >= '2018-02-01' AND logged_date < '2018-02-04' AND t.india_romeo = 569; Nested Loop (cost=0.280..1328182.030 rows =2127135 width=335) -> Index Scan using six on victor_oscar echo (cost=0.280..8.290 rows =1 width=71) Index Cond: (india_romeo = 569) -> Append (cost=0.000..1306902.390 rows =2127135 width=264) -> Seq Scan on india_echo romeo (cost=0.000..0.000 rows =1 width=264) Filter: ((logged_date >= '2018-02-01' :: timestamp with time zone) AND (logged_date < '2018-02-04' :: timestamp with time zone) AND (india_romeo = 569)) -> Seq Scan on juliet victor_echo (cost=0.000..437153.700 rows =711789 width=264) Filter: ((logged_date >= '2018-02-01' :: timestamp with time zone) AND (logged_date < '2018-02-04' :: timestamp with time zone) AND (india_romeo = 569)) -> Seq Scan on india_papa quebec_bravo (cost=0.000..434936.960 rows =700197 width=264) Filter: ((logged_date >= '2018-02-01' :: timestamp with time zone) AND (logged_date < '2018-02-04' :: timestamp with time zone) AND (india_romeo = 569)) -> Seq Scan on two oscar (cost=0.000..434811.720 rows =715148 width=264) Filter: ((logged_date >= '2018-02-01' :: timestamp with time zone) AND (logged_date < '2018-02-04' :: timestamp with time zone) AND (india_romeo = 569)) |
This example shows a query plan for a two table join that also hits a partitioned table. We’re looking for anything that can cause the query to be slow, and in this case the planner is doing several Sequential Scans on partitions, suggesting that they are missing indexes. Adding indexes to these tables for column ‘india_romeo’ will instantly improve this query.
Things to look for are sequential scans, nested loops, expensive sorting, etc. Understanding the query planner is crucial to making sure queries are performing the best way possible, official documentation can be read for more information here.
2.2. Are the tables involved bloated?
If the queries are still feeling slow without the query planner pointing at anything obvious, it’s time to check the health of the tables involved. Are they too big? Are they bloated?
1 2 3 4 5 | SELECT n_live_tup, n_dead_tup from pg_stat_user_tables where relname = ‘mytable’; n_live_tup | n_dead_tup ------------+------------ 15677 | 8275431 (1 row) |
Here we see that there are many times more dead rows than live rows, which means to find the correct rows, the engine must sift through data that’s not even relevant to find real data. A vacuum / vacuum full on this table will increase performance significantly.
Step 3 - Check the Logs
If the issue still can’t be found, check the logs for any clues.
FATAL / ERROR messages:
Look for messages that may be causing issues, such as deadlocks or long wait times to gain a lock.
Checkpoints
Hopefully log_checkpoints is set to on, which will write checkpoint information to the logs. There are two types of checkpoints, timed and requested (forced). If checkpoints are being forced, then dirty buffers in memory must be written to disk before processing more queries, which can give a database system an overall feeling of “slowness”. Increasing checkpoint_segments or max_wal_size (depending on the database version) will give the checkpointer more room to work with, as well as help the background writer take some of the writing load.
Step 4 - What’s the Health of the Host System?
If there’s no clues in the database itself, perhaps the host itself is overloaded or having issues. Anything from an overloaded IO chanel to disk, memory overflowing to swap, or even a failing drive, none of these issues would be apparent with anything we looked at before. Assuming the database is running on a *nix based operating system, here are a few things that can help.
4.1. System Load
Using ‘top’, look at the load average for the host. If the number is approaching or exceeding the number of cores on the system, it could be simply too many concurrent connections hitting the database bringing it to a crawl to catch up.
1 | load average: 3.43, 5.25, 4.85 |
4.2. System Memory and SWAP
Using ‘free’, check to see if SWAP has been used at all. Memory overflowing to SWAP in a PostgreSQL database environment is extremely bad for performance, and many DBA’s will even eliminate SWAP from database hosts, as an ‘out of memory’ error is more preferable than a sluggish system to many.
If SWAP is being used, a reboot of the system will clear it out, and increasing total system memory or re-configuring memory usage for PostgreSQL (such as lowering shared_buffers or work_mem) may be in order.
1 2 3 4 | [postgres@livedb1 ~]$ free -m total used free shared buff /cache available Mem: 7986 225 1297 12 6462 7473 Swap: 7987 2048 5939 |
4.3. Disk Access
PostgreSQL attempts to do a lot of its work in memory, and spread out writing to disk to minimize bottlenecks, but on an overloaded system with heavy writing, it’s easily possible to see heavy reads and writes cause the whole system to slow as it catches up on the demands. Faster disks, more disks and IO channels are some ways to increase the amount of work that can be done.
Tools like ‘iostat’ or ‘iotop’ can help pinpoint if there is a disk bottleneck, and where it may be coming from.
4.4. Check theLogs
If all else fails, or even if not, logs should always be checked to see if the system is reporting anything that’s not right. We already discussed checking the postgresql.logs, but the system logs can give information about issues such as failing disks, failing memory, network problems, etc. Any one of these issues can cause the database to act slow and unpredictable, so a good understanding of perfect health can help find these issues.
Step 5 - Something Still not Make Sense?
Even the most seasoned administrators will run into something new that doesn’t make sense. That’s where the global PostgreSQL community can come in to help out. Much like step #0, the more clear information given to the community, the easier they can help out.
5.1. PostgreSQL Mailing Lists
Since PostgreSQL is developed and managed by the open source community, there are thousands of people who talk through the mailing lists to discuss countless topics including features, errors, and performance issues. The mailing lists can be found here, with pgsql-admin and pgsql-performance being the most important for looking for help with performance issues.
5.2. IRC
Freenode hosts several PostgreSQL channels with developers and administrators all over the world, and it’s not hard to find a helpful person to track down where issues may be coming from. More information can be found on the PostgreSQL IRC page.
Finding long running SQL queries in PostgreSQL
Recently in one of the projects that I was working on, I came across a situation where the SQL query times was getting slower and slower per day, and I had to figure out what was happening to the system.
One of the things that I checked immediately was whether there were any long running queries in the system, that is affecting the other queries, and whoila, there it was - the system had a bunch of rogue queries that had been running for over a day or so.
Running the following command gave me queries that were running for more than 5 seconds.
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
state,
query
FROM pg_stat_activity
WHERE state ='active' AND query NOT ILIKE '%pg_stat_activity%' AND (now() - pg_stat_activity.query_start) > interval '5 seconds'
ORDER BY duration desc;
This would give you a result like:
-[ RECORD 1 ]------------------------------------------------------------
pid | 6473
duration | 00:00:01.011715
state | active
query | ..........
Now you have the pid
of the query, lets try to get some more information on the query.
strace -f -s2048 -p 6473
Running a strace
on the query gives you more information on what the query is doing.
In my case, it was waiting indefinitely for acquiring a lock, which would never succeed. Solution for me was to terminate the query as it had already gone rogue.
SELECT pg_cancel_backend(6473)
the pg_cancel_backend
command kills the running query.
No comments:
Post a Comment