I have a mildly complex query that is having rather poor performance:
UPDATE
web_pages
SET
state = 'fetching'
WHERE
web_pages.id = (
SELECT
web_pages.id
FROM
web_pages
WHERE
web_pages.state = 'new'
AND
normal_fetch_mode = true
AND
web_pages.priority = (
SELECT
min(priority)
FROM
web_pages
WHERE
state = 'new'::dlstate_enum
AND
distance < 1000000
AND
normal_fetch_mode = true
AND
web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval
)
AND
web_pages.distance < 1000000
AND
web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval
LIMIT 1
)
AND
web_pages.state = 'new'
RETURNING
web_pages.id;
EXPLAIN ANALYZE
:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Update on web_pages (cost=2.12..10.14 rows=1 width=798) (actual time=2312.127..2312.127 rows=0 loops=1)
InitPlan 3 (returns $2)
-> Limit (cost=1.21..1.56 rows=1 width=4) (actual time=2312.118..2312.118 rows=0 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=0.77..0.78 rows=1 width=0) (actual time=2312.109..2312.110 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.77 rows=1 width=4) (actual time=2312.106..2312.106 rows=0 loops=1)
-> Index Scan using ix_web_pages_distance_filtered on web_pages web_pages_1 (cost=0.43..176587.44 rows=509043 width=4) (actual time=2312.103..2312.103 rows=0 loops=1)
Index Cond: (priority IS NOT NULL)
Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
-> Index Scan using ix_web_pages_distance_filtered on web_pages web_pages_2 (cost=0.43..35375.47 rows=101809 width=4) (actual time=2312.116..2312.116 rows=0 loops=1)
Index Cond: (priority = $1)
Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
-> Index Scan using ix_web_pages_id on web_pages (cost=0.56..8.58 rows=1 width=798) (actual time=2312.124..2312.124 rows=0 loops=1)
Index Cond: (id = $2)
Filter: (state = 'new'::dlstate_enum)
Planning time: 1.712 ms
Execution time: 2313.699 ms
(18 rows)
Table Schema:
Table "public.web_pages"
Column | Type | Modifiers
-------------------+-----------------------------+---------------------------------------------------------------------
id | integer | not null default nextval('web_pages_id_seq'::regclass)
state | dlstate_enum | not null
errno | integer |
url | text | not null
starturl | text | not null
netloc | text | not null
file | integer |
priority | integer | not null
distance | integer | not null
is_text | boolean |
limit_netloc | boolean |
title | citext |
mimetype | text |
type | itemtype_enum |
content | text |
fetchtime | timestamp without time zone |
addtime | timestamp without time zone |
tsv_content | tsvector |
normal_fetch_mode | boolean | default true
ignoreuntiltime | timestamp without time zone | not null default '1970-01-01 00:00:00'::timestamp without time zone
Indexes:
"web_pages_pkey" PRIMARY KEY, btree (id)
"ix_web_pages_url" UNIQUE, btree (url)
"idx_web_pages_title" gin (to_tsvector('english'::regconfig, title::text))
"ix_web_pages_distance" btree (distance)
"ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true
"ix_web_pages_id" btree (id)
"ix_web_pages_netloc" btree (netloc)
"ix_web_pages_priority" btree (priority)
"ix_web_pages_state" btree (state)
"ix_web_pages_url_ops" btree (url text_pattern_ops)
"web_pages_state_netloc_idx" btree (state, netloc)
Foreign-key constraints:
"web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id)
Triggers:
update_row_count_trigger BEFORE INSERT OR UPDATE ON web_pages FOR EACH ROW EXECUTE PROCEDURE web_pages_content_update_func()
I’ve experimented with creating compound indexes on multiple columns to try to improve the query performance, without much luck. I have VACUUM ANALYZE
d for the above EXPLAIN
.
The cardinality of the priority
column is quite low, it has about 5 distinct values, and the size of the overall table is fairly large (55,659,673 rows).
Query execution time is rather variable, generally 2 seconds worst-case, 600 milliseconds best case, when the entire index is cached in ram (when the DB isn’t under other loads).
It seems that the major load is the min(priority)
subselect, but I haven’t had much luck with creating indices that improve it’s performance, though that may entirely be operator error:
EXPLAIN ANALYZE
SELECT
min(priority)
FROM
web_pages
WHERE
state = 'new'::dlstate_enum
AND
distance < 1000000
AND
normal_fetch_mode = true
AND
web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.77..0.78 rows=1 width=0) (actual time=625.380..625.381 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.77 rows=1 width=4) (actual time=625.375..625.375 rows=0 loops=1)
-> Index Scan using ix_web_pages_distance_filtered on web_pages (cost=0.43..176587.44 rows=509043 width=4) (actual time=625.373..625.373 rows=0 loops=1)
Index Cond: (priority IS NOT NULL)
Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
Planning time: 0.475 ms
Execution time: 625.408 ms
(8 rows)
Are there any easy ways to improve the performance of this query? I’ve thought about maintaining a running count of each sub-value in the column with a append-only count table that’s updated with triggers, but that’s complex and a fair bit of effort, and I want to be sure there isn’t a simpler approach before implementing all that.