Quantcast
Channel: Question and Answer » postgresql
Viewing all articles
Browse latest Browse all 1138

Slow query over 10mil rows with group by on indexed column

$
0
0

I’ve got a slow query which I want to optimize.

SELECT
    pand.bouwjaar AS construction_date,
    count(*)
FROM pand AS pand
WHERE 
    pand.begindatumtijdvakgeldigheid <= 'now'::text::timestamp without time zone 
    AND (pand.einddatumtijdvakgeldigheid IS NULL OR pand.einddatumtijdvakgeldigheid >= 'now'::text::timestamp without time zone) 
    AND pand.aanduidingrecordinactief = false 
    AND pand.geom_valid = true 
    AND pand.pandstatus <> 'Niet gerealiseerd pand'::pandstatus 
    AND pand.pandstatus <> 'Pand gesloopt'::pandstatus

GROUP BY pand.bouwjaar 

At the moment it takes 37999.326 ms while it’s ‘just’ a group by and count on a indexed column. I know the query is hard, it has to process over 10 million rows, but i hope the query can do better then 38 seconds.

This is the Explain(analyze, buffers):

HashAggregate  (cost=1087761.16..1087762.90 rows=174 width=5) (actual time=37998.961..37999.173 rows=855 loops=1)
  Buffers: shared hit=5312 read=606349
  ->  Seq Scan on pand  (cost=0.00..1037569.41 rows=10038351 width=5) (actual time=1.237..31246.011 rows=9935732 loops=1)
        Filter: ((NOT aanduidingrecordinactief) AND geom_valid AND (pandstatus <> 'Niet gerealiseerd pand'::pandstatus) AND (pandstatus <> 'Pand gesloopt'::pandstatus) AND (begindatumtijdvakgeldigheid <= ('now'::cstring)::timestamp without time zone) AND ( (...)
        Rows Removed by Filter: 4261215
        Buffers: shared hit=5312 read=606349
Total runtime: 37999.326 ms

If I’m ready this Explain-message correct my biggest problem lies with the HasAggregate with the high Cost.

The strange thing is I already have an index on pand.bouwjaar.
How can I speed up this query? Or should I deal with it ;)

When removing the whole Where clause I’m getting the following queryplan:

HashAggregate  (cost=824615.21..824617.66 rows=245 width=5) (actual time=23589.184..23589.612 rows=1033 loops=1)
  Buffers: shared hit=5473 read=606188
  ->  Seq Scan on pand  (cost=0.00..753630.47 rows=14196947 width=5) (actual time=2.528..13207.042 rows=14196947 loops=1)
        Buffers: shared hit=5473 read=606188
Total runtime: 23589.862 ms

Still 23.6 seconds :(

Update

As requested the information on the table and version info:

PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit

CREATE TABLE pand
(
  gid serial NOT NULL,
  identificatie numeric(16,0),
  aanduidingrecordinactief boolean,
  aanduidingrecordcorrectie integer,
  officieel boolean,
  inonderzoek boolean,
  begindatumtijdvakgeldigheid timestamp without time zone,
  einddatumtijdvakgeldigheid timestamp without time zone,
  documentnummer character varying(20),
  documentdatum date,
  pandstatus pandstatus,
  bouwjaar numeric(4,0),
  geom_valid boolean,
  geovlak geometry,
  CONSTRAINT pand_pkey PRIMARY KEY (gid),
  CONSTRAINT enforce_dims_geometrie CHECK (st_ndims(geovlak) = 3),
  CONSTRAINT enforce_geotype_geometrie CHECK (geometrytype(geovlak) = 'POLYGON'::text OR geovlak IS NULL),
  CONSTRAINT enforce_srid_geometrie CHECK (st_srid(geovlak) = 28992)
)
WITH (
  OIDS=TRUE
);

Indexes on this table:

CREATE INDEX pand_key
  ON pand
  USING btree
  (identificatie, aanduidingrecordinactief, aanduidingrecordcorrectie, begindatumtijdvakgeldigheid);


CREATE INDEX pand_idx_bouwjaar
  ON pand
  USING btree
  (bouwjaar);

CREATE INDEX pand_geom_idx
  ON pand
  USING gist
  (geovlak);

Viewing all articles
Browse latest Browse all 1138

Trending Articles