Quantcast
Channel: Question and Answer » postgresql
Viewing all articles
Browse latest Browse all 1138

Why are correlated subqueries sometimes faster than joins in Postgres?

$
0
0

This relates to the dataset described in Postgres is performing sequential scan instead of index scan

I’ve started work on adapting the import logic to work with a more normalised schema – no surprises here it’s faster and more compact – but I’ve hit a roadblock updating the existing data: adding and updating with relevant foreign keys is taking an age.

UPDATE pages
SET id_site = id FROM sites
WHERE sites.url = pages."urlShort"
AND "labelDate" = '2015-01-15'

NB pages.”urlShort” and sites.url are textfields, both are indexed but currently have no explicit relationship.

There are around 500,000 rows for each date value and updates like this are taking around 2h30. :-(

I looked at what the underlying query might look at:

select * from pages
join sites on
sites.url = pages."urlShort"
where "labelDate" = '2015-01-01'

This takes around 6 minutes to run has query plan like this:

"Hash Join  (cost=80226.81..934763.02 rows=493018 width=365)"
"  Hash Cond: ((pages."urlShort")::text = sites.url)"
"  ->  Bitmap Heap Scan on pages  (cost=13549.32..803595.26 rows=493018 width=315)"
"        Recheck Cond: ("labelDate" = '2015-01-01'::date)"
"        ->  Bitmap Index Scan on "pages_labelDate_idx"  (cost=0.00..13426.07 rows=493018 width=0)"
"              Index Cond: ("labelDate" = '2015-01-01'::date)"
"  ->  Hash  (cost=30907.66..30907.66 rows=1606466 width=50)"
"        ->  Seq Scan on sites  (cost=0.00..30907.66 rows=1606466 width=50)"

Based on some help in the past on related subjects I decided to compare this with a similar query that used a correlated subquery instead of a join.

SELECT "urlShort" AS url
FROM pages
WHERE 
"labelDate" = '2015-01-01'
and id_site is NULL
AND EXISTS
(SELECT * FROM sites
     WHERE sites.url = pages."urlShort")

This query only takes about 15s to run and has the following query plan:

"Hash Join  (cost=64524.36..860389.62 rows=423223 width=27)"
"  Hash Cond: ((pages."urlShort")::text = sites.url)"
"  ->  Bitmap Heap Scan on pages  (cost=13535.88..803581.81 rows=423223 width=27)"
"        Recheck Cond: ("labelDate" = '2015-01-01'::date)"
"        Filter: (id_site IS NULL)"
"        ->  Bitmap Index Scan on "pages_labelDate_idx"  (cost=0.00..13430.07 rows=493018 width=0)"
"              Index Cond: ("labelDate" = '2015-01-01'::date)"
"  ->  Hash  (cost=30907.66..30907.66 rows=1606466 width=27)"
"        ->  Seq Scan on sites  (cost=0.00..30907.66 rows=1606466 width=27)"

There are two things I’d like to know:
1) Can adjust the update to run faster based on the above?
2) What parts of the query plan are telltales for running slow? Or do you always have to run EXPLAIN ANALYZE to findout?


Viewing all articles
Browse latest Browse all 1138

Trending Articles