Quantcast
Channel: Question and Answer » postgresql
Viewing all articles
Browse latest Browse all 1138

How can I improve this nested postgresql query?

$
0
0

I have a mildly complex query that is having rather poor performance:

UPDATE
    web_pages
SET
    state = 'fetching'
WHERE
    web_pages.id = (
        SELECT
            web_pages.id
        FROM
            web_pages
        WHERE
            web_pages.state = 'new'
        AND
            normal_fetch_mode = true
        AND
            web_pages.priority = (
               SELECT
                    min(priority)
                FROM
                    web_pages
                WHERE
                    state = 'new'::dlstate_enum
                AND
                    distance < 1000000
                AND
                    normal_fetch_mode = true
                AND
                    web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval
            )
        AND
            web_pages.distance < 1000000
        AND
            web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval
        LIMIT 1
    )
AND
    web_pages.state = 'new'
RETURNING
    web_pages.id;

EXPLAIN ANALYZE:

                                                                                             QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages  (cost=2.12..10.14 rows=1 width=798) (actual time=2312.127..2312.127 rows=0 loops=1)
   InitPlan 3 (returns $2)
     ->  Limit  (cost=1.21..1.56 rows=1 width=4) (actual time=2312.118..2312.118 rows=0 loops=1)
           InitPlan 2 (returns $1)
             ->  Result  (cost=0.77..0.78 rows=1 width=0) (actual time=2312.109..2312.110 rows=1 loops=1)
                   InitPlan 1 (returns $0)
                     ->  Limit  (cost=0.43..0.77 rows=1 width=4) (actual time=2312.106..2312.106 rows=0 loops=1)
                           ->  Index Scan using ix_web_pages_distance_filtered on web_pages web_pages_1  (cost=0.43..176587.44 rows=509043 width=4) (actual time=2312.103..2312.103 rows=0 loops=1)
                                 Index Cond: (priority IS NOT NULL)
                                 Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
           ->  Index Scan using ix_web_pages_distance_filtered on web_pages web_pages_2  (cost=0.43..35375.47 rows=101809 width=4) (actual time=2312.116..2312.116 rows=0 loops=1)
                 Index Cond: (priority = $1)
                 Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
   ->  Index Scan using ix_web_pages_id on web_pages  (cost=0.56..8.58 rows=1 width=798) (actual time=2312.124..2312.124 rows=0 loops=1)
         Index Cond: (id = $2)
         Filter: (state = 'new'::dlstate_enum)
 Planning time: 1.712 ms
 Execution time: 2313.699 ms
(18 rows)

Table Schema:

                                               Table "public.web_pages"
      Column       |            Type             |                              Modifiers
-------------------+-----------------------------+---------------------------------------------------------------------
 id                | integer                     | not null default nextval('web_pages_id_seq'::regclass)
 state             | dlstate_enum                | not null
 errno             | integer                     |
 url               | text                        | not null
 starturl          | text                        | not null
 netloc            | text                        | not null
 file              | integer                     |
 priority          | integer                     | not null
 distance          | integer                     | not null
 is_text           | boolean                     |
 limit_netloc      | boolean                     |
 title             | citext                      |
 mimetype          | text                        |
 type              | itemtype_enum               |
 content           | text                        |
 fetchtime         | timestamp without time zone |
 addtime           | timestamp without time zone |
 tsv_content       | tsvector                    |
 normal_fetch_mode | boolean                     | default true
 ignoreuntiltime   | timestamp without time zone | not null default '1970-01-01 00:00:00'::timestamp without time zone
Indexes:
    "web_pages_pkey" PRIMARY KEY, btree (id)
    "ix_web_pages_url" UNIQUE, btree (url)
    "idx_web_pages_title" gin (to_tsvector('english'::regconfig, title::text))
    "ix_web_pages_distance" btree (distance)
    "ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true
    "ix_web_pages_id" btree (id)
    "ix_web_pages_netloc" btree (netloc)
    "ix_web_pages_priority" btree (priority)
    "ix_web_pages_state" btree (state)
    "ix_web_pages_url_ops" btree (url text_pattern_ops)
    "web_pages_state_netloc_idx" btree (state, netloc)
Foreign-key constraints:
    "web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id)
Triggers:
    update_row_count_trigger BEFORE INSERT OR UPDATE ON web_pages FOR EACH ROW EXECUTE PROCEDURE web_pages_content_update_func()

I’ve experimented with creating compound indexes on multiple columns to try to improve the query performance, without much luck. I have VACUUM ANALYZEd for the above EXPLAIN.

The cardinality of the priority column is quite low, it has about 5 distinct values, and the size of the overall table is fairly large (55,659,673 rows).

Query execution time is rather variable, generally 2 seconds worst-case, 600 milliseconds best case, when the entire index is cached in ram (when the DB isn’t under other loads).

It seems that the major load is the min(priority) subselect, but I haven’t had much luck with creating indices that improve it’s performance, though that may entirely be operator error:

EXPLAIN ANALYZE                
SELECT
    min(priority)
FROM
    web_pages
WHERE
    state = 'new'::dlstate_enum
AND
    distance < 1000000
AND
    normal_fetch_mode = true
AND
    web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.77..0.78 rows=1 width=0) (actual time=625.380..625.381 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.77 rows=1 width=4) (actual time=625.375..625.375 rows=0 loops=1)
           ->  Index Scan using ix_web_pages_distance_filtered on web_pages  (cost=0.43..176587.44 rows=509043 width=4) (actual time=625.373..625.373 rows=0 loops=1)
                 Index Cond: (priority IS NOT NULL)
                 Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
 Planning time: 0.475 ms
 Execution time: 625.408 ms
(8 rows)

Are there any easy ways to improve the performance of this query? I’ve thought about maintaining a running count of each sub-value in the column with a append-only count table that’s updated with triggers, but that’s complex and a fair bit of effort, and I want to be sure there isn’t a simpler approach before implementing all that.


Viewing all articles
Browse latest Browse all 1138

Trending Articles