Postgres INNER JOIN same table failing on CartoDB

August 24, 2015, 12:50 am

≪ Previous: How to create MBtiles from geoserver

I have a table with data which looks like this

+----+------+-------+
| cp | year | count |
+----+------+-------+
| 1  | 2000 | 10000 |
| 1  | 2001 | 9000  |
| 1  | 2002 | 9500  |
| 2  | 2000 | 8000  |
| 2  | 2001 | 7500  |
| 2  | 2002 | 7000  |
+----+------+-------+

Every cp is guaranteed to have the same number of years. What I want is the percentage change between the current year and a reference year. So I’m after this:

+----+------+-------+-----------+--------------+
| cp | year | count | since2000 | since2000pct |
+----+------+-------+-----------+--------------+
| 1  | 2000 | 10000 | 0         | 0            |
| 1  | 2001 | 9000  | -1000     | -0.1         |
| 1  | 2002 | 9500  | -500      | -0.05        |
| 2  | 2000 | 20000 | 0         | 0            |
| 2  | 2001 | 16000 | -4000     | -0.2         |
| 2  | 2002 | 21000 | 1000      | 0.05         |
+----+------+-------+-----------+--------------+

It’s been awhile since I’ve done much with SQL, but this looks like a pretty straightforward inner join. All I should need is the year 2000 count repeated for every row, and the rest is just math. I got it working using SQLFiddler:

http://sqlfiddle.com/#!15/81a94/1/0

SELECT
*
FROM traffic_counts as s1
INNER JOIN
(SELECT cp, count AS count2000 FROM traffic_counts WHERE year = 2000) s2
ON s1.cp = s2.cp;

But when I try to run this on CartoDB (PostgreSQL 9.3.4) I’m getting an error saying that cp is ambiguous. (Fiddler has no problem…) I even tried aliasing the table to s3 inside the subselect and fully qualifying the first “cp”, but the error was the same.

Can anyone help?

↧

Can't create user in postgres

August 24, 2015, 7:00 am

≫ Next: Index a Table with Geometry Column

≪ Previous: Postgres INNER JOIN same table failing on CartoDB

When I run CREATE USER username WITH PASSWORD 'test' in sudo -u postgres psql and run du only postgres user appears… I’m not prompted anything when running commands like ALTER USER and I know USER ALTERED should appear, or something similar.

↧

Index a Table with Geometry Column

August 25, 2015, 5:50 am

≫ Next: postgresql – How to limit access to specific database only, and restrict access to system tables

≪ Previous: Can't create user in postgres

I have a table called Stores with 10k rows which has a column location geometry(Point,4326).

CREATE INDEX "Stores_spatial_gix"
ON "Stores"
USING gist
(location);

Upon KNN query

explain analyze select *
 from "Stores"
 order by ST_distance_sphere(location,st_point(-82.373978, 29.633657)) limit 3

I get about 800ms each time. What am I doing wrong that it takes so long time?

↧

postgresql – How to limit access to specific database only, and restrict access to system tables

December 19, 2015, 5:00 pm

≫ Next: Overlapping line aggregation+trim in PostGIS, while summing values

≪ Previous: Index a Table with Geometry Column

I have a user test, which can view data from the system tables such as pg_class. The idea is to isolate him to have access to a specific database only and not have any access to system resources. The question is, how can one achieve this?

↧

Overlapping line aggregation+trim in PostGIS, while summing values

December 19, 2015, 9:50 pm

≫ Next: Multiple queries in pgrouting (length as result)

≪ Previous: postgresql – How to limit access to specific database only, and restrict access to system tables

I’m sorry for the vague title, I didn’t really know how to name this. I am working on a project that visualizes and simulates 17th century trade. For this project we also use so called Minard lines to show how certain goods are distributed. To clarify I made this screenshot:

However, as you can see every route is a seperate line (Amsterdam <> Jakarta and Rotterdam <> Jakarta for instance) and they all overlap with each other. This does give a nice Tron effect, but we actually want to merge them. We want to merge the lines that overlap (while adding their values), and segmentize those that don’t (while remaining their values).

I’ve been thinking a lot about this on how to do this in postgres and postgis, but I’m having real troubles with it. Functions like ST_union aggregates everything, Could anyone point me in the right direction? Here’s a screenshot of the database:

↧

Multiple queries in pgrouting (length as result)

December 20, 2015, 3:00 pm

≫ Next: What's better here – a single column or multicolumn primary key?

≪ Previous: Overlapping line aggregation+trim in PostGIS, while summing values

I am facing the following problem with two tables:

The first table contains all my streetnetwork with cost, source, target. Another table contains 60 nodes (from the above network). Between all these 60 nodes I want to calculate the length of the path (e.g.: from node 1 to 2,3,4…60, from node 2 to 1,3,4,5…60 etc.)

Do you have a solution which could automatically calculate the length of all paths? Until now I can just calculate manually one path, then save the table and calculate the length.

But as you can imagine I don’t wanna calculate 60*60 relations manually.

See my SQL-query so far:

SELECT seq, id1 AS node, id2 AS edge, a.cost, geom_way
FROM pgr_dijkstra(
'SELECT id, source, target, km as cost, km as reverse_cost FROM 
routing where cost is not null',
(SELECT nodes FROM nodelist WHERE id =1),
(SELECT nodes FROM nodelist WHERE id =2), 
false, 
true
) AS a
Join routing b
ON a.id2 = b.id
ORDER BY seq;

↧

What's better here – a single column or multicolumn primary key?

December 22, 2015, 1:00 am

≫ Next: Sessionizing Log Data Using SQL window functions: tweaking a query (incrementing session ID and event length)

≪ Previous: Multiple queries in pgrouting (length as result)

Let’s assume you have a table groups and a table item. Each item belongs to exactly one group. It is an inherent part of that group. An item cannot exist outside of a group and it cannot be moved to another group.

When trying to decide on a primary key for the item table, what should I use?

Should I make up an artificial global serial key like this:

CREATE TABLE items
(
    item serial PRIMARY KEY,
    group integer NOT NULL REFERENCES groups(group),
);

… or should I rather use a composite key and per-group item serial like this:

CREATE TABLE items
(
    group integer NOT NULL REFERENCES groups(group),
    item integer NOT NULL,

    PRIMARY KEY(group, item)
);

The reason why I’m leaning more towards the second solution is that the post URL will always show the group and item, so it makes sense to have both of them as the composite primary key. In case of the first solution, the URL contains superfluous information because the group ID can already be deduced from the item ID alone. The URL structure is given, however, and cannot be changed.

The disadvantage of the second solution is that you have to manage a per-group serial (i.e. each item integer should start from 0 for each group).

What’s better in terms of best practices, normalization and performance? Or is it simply a matter of taste?

↧

Sessionizing Log Data Using SQL window functions: tweaking a query (incrementing session ID and event length)

December 22, 2015, 1:00 am

≫ Next: Debug PostgreSQL lock count with check_postgres.pl

≪ Previous: What's better here – a single column or multicolumn primary key?

I’m referencing to a very good blog entry, that describes how to sessionize log data with SQL window functions. I’m no database pro and I don’t really understand the logic behind this query (you can see the result by following the above link, see “Query 2″):

select
uid,
sum(new_event_boundary) OVER (PARTITION BY uid ORDER BY event_timestamp) as session_id,
event_timestamp,
minutes_since_last_interval,
new_event_boundary
from 
            --Query 1: Define boundary events
            (select
            uid,
            event_timestamp,
            (extract(epoch from event_timestamp) - lag(extract(epoch from event_timestamp)) OVER (PARTITION BY uid ORDER BY event_timestamp))/60 as minutes_since_last_interval,
            case when extract(epoch from event_timestamp) - lag(extract(epoch from event_timestamp)) OVER (PARTITION BY uid ORDER BY event_timestamp) > 30 * 60 then 1 ELSE 0 END as new_event_boundary
            from single_col_timestamp
            ) a;

This query assigns a session id for every uid, delimited by 30 minutes. For every new uid the session id is reset to 0.

I want to modify the query, that the session id increments all the way up regardless of a “new” uid.
Furthermore I want to calculate the length of every event of a session (I know, not possible for the last event of a session).

I’m sure, the modification is not very hard, but i tried the whole day and it didn’t work out. Would be great, if you could help me :-)

↧

Debug PostgreSQL lock count with check_postgres.pl

December 23, 2015, 11:50 am

≫ Next: Improve the speed of a BETWEEN search

≪ Previous: Sessionizing Log Data Using SQL window functions: tweaking a query (incrementing session ID and event length)

We use check_postgres.pl to monitor our database.

We use this to check the count of the locks:

https://bucardo.org/check_postgres/check_postgres.pl.html#locks

We often see more than 150 locks.

The question was: What is going on? We patched the script to output this sql statement, if the lock count was exceeded:

select * from pg_stat_activity order by datname

Unfortunately the result is not what I was expecting. Although there are more than 150 locks, pg_stat_activity shows only few (less then 10) queries.

This has happened about twice a day during the last days, and every time only few lines where returned by pg_stat_activity.

What is going on here?

How to debug the current DB state if there are too many locks?

↧

Improve the speed of a BETWEEN search

December 24, 2015, 6:50 am

≫ Next: Is Drupal 8 with PostgreSQL supported?

≪ Previous: Debug PostgreSQL lock count with check_postgres.pl

I have a query witch is selecting some data between two dates:

SELECT * FROM rk.p_reports
WHERE project_id = 1 AND engine_id = 1 AND type_id = 1 
  AND created_at::date BETWEEN now()::date - 14 AND now()::date

I have tried creating various indexes to speedup the queries but without any luck

First index: project_id, engine_id, type_id, created_at ASC, created_at DESC

Second index: project_id, engine_id, type_id, created_at ASC AND project_id, engine_id, type_id, created_at DESC

Third index: project_id, engine_id, type_id, created_at ASC

And other combinations, and non of the index was picked without turning enable_seqscan = off

At the moment it will take around 3 sec to return my rows, this partition has around 1.5 mil rows. I am using Postgresql 9.4 on the DB server has 8 Gb ram and 6 cores.

Is there any way I can reduce this to under a sec?

Seq Scan on p_reports_1  (cost=0.00..57649.04 rows=4322 width=40) (actual time=614.221..696.583 rows=92607 loops=1)
Filter: ((project_id = 1) AND (engine_id = 1) AND (type_id = 1) AND ((created_at)::date <= (now())::date) AND ((created_at)::date >= ((now())::date - 14)))
Rows Removed by Filter: 1063344
Planning time: 0.200 ms
Execution time: 703.493 ms

Update, query modified as @ypercube sugested:

Bitmap Heap Scan on p_reports_1  (cost=2952.28..17478.45 rows=89005 width=40) (actual time=17.485..34.569 rows=87712 loops=1)
Recheck Cond: ((project_id = 1) AND (engine_id = 1) AND (type_id = 1) AND (created_at >= ((now())::date - 14)) AND (created_at <= (now())::date))
Heap Blocks: exact=1180
 ->  Bitmap Index Scan on p_reports_1_asc_created_at  (cost=0.00..2930.03 rows=89005 width=0) (actual time=17.236..17.236 rows=87712 loops=1)
    Index Cond: ((project_id = 1) AND (engine_id = 1) AND (type_id = 1) AND (created_at >= ((now())::date - 14)) AND (created_at <= (now())::date))
Planning time: 0.319 ms
Execution time: 41.569 ms

↧

Is Drupal 8 with PostgreSQL supported?

December 26, 2015, 4:50 am

≫ Next: How to optimize query with max on Postgres?

≪ Previous: Improve the speed of a BETWEEN search

Is the new Drupal 8 supported/certified by the Drupal community to run on PostgreSQL?

This is the top hit,

https://www.drupal.org/node/2157455

I don’t really know what that means here in the Open Source part as the majority of support is MySQL (or so it appears).

↧

How to optimize query with max on Postgres?

December 26, 2015, 9:50 am

≫ Next: Could not locate a valid checkpoint record postgres

≪ Previous: Is Drupal 8 with PostgreSQL supported?

I’m trying to make a relatively simple query, but it’s taking much longer than I’d expect. I have an index in place, but it doesn’t seem to be helping much.

Here’s the query. It takes over an hour and a half to execute:

SELECT MAX("transactions"."api_last_change_date") AS max_id 
FROM "transactions" 
WHERE "transactions"."practice_id" = 466;

Here’s the table, with indices:

                                          Table "public.transactions"
        Column        |            Type             |                         Modifiers
----------------------+-----------------------------+------------------------------------------------------------
 id                   | integer                     | not null default nextval('transactions_id_seq'::regclass)
 installation_id      | character varying(255)      |
 data_provider_id           | character varying(255)      | not null
 transaction_date     | timestamp without time zone |
 pms_code             | character varying(255)      |
 pms_code_data_provider_id  | character varying(255)      |
 description          | text                        |
 quantity             | integer                     |
 amount               | integer                     |
 is_payment           | character varying(255)      |
 trans_type           | character varying(255)      |
 client_pms_id        | character varying(255)      |
 patient_pms_id       | character varying(255)      |
 client_data_provider_id    | character varying(255)      |
 patient_data_provider_id   | character varying(255)      | not null
 provider_id          | character varying(255)      |
 provider_name        | character varying(255)      |
 invoice_id           | character varying(255)      |
 transaction_total    | integer                     |
 practice_id          | integer                     | not null
 api_create_date      | timestamp without time zone | default '2015-10-09 05:00:00'::timestamp without time zone
 api_last_change_date | timestamp without time zone | default '2015-10-09 05:00:00'::timestamp without time zone
 api_removed_date     | timestamp without time zone |
Indexes:
    "transactions_pkey" PRIMARY KEY, btree (id)
    "index_transactions_on_api_last_change_date" btree (api_last_change_date)
    "index_transactions_on_patient_data_provider_id" btree (patient_data_provider_id)
    "index_transactions_on_practice_id" btree (practice_id)
    "index_transactions_on_data_provider_id" btree (data_provider_id)
    "transactions_practice_id_api_last_change_date" btree (practice_id, api_last_change_date DESC NULLS LAST)

Here’s the results of explain on the query:

explain SELECT MAX("transactions"."api_last_change_date") AS max_id FROM "transactions" WHERE "transactions"."practice_id" = 466;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=818.01..818.02 rows=1 width=0)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.11..818.01 rows=1 width=8)
           ->  Index Scan Backward using index_transactions_on_api_last_change_date on transactions  (cost=0.11..141374929.80 rows=172851 width=8)
                 Index Cond: (api_last_change_date IS NOT NULL)
                 Filter: (practice_id = 466)
(6 rows)

The transactions table has almost 150 million records. Only about 20000 records have practice_id = 466.

As you can see, there are multiple indices on the table, including one that I thought would work specifically for this query (transactions_practice_id_api_last_change_date), but postgres is choosing to use a different one (index_patients_on_api_last_change_date). From my understanding the index I created should work well, and should be max O(logn) using a btree with the two parameters as the given key.

I was trying to get an explain analyze result for this, but each time I’ve tried to run it, I’ve run into connectivity issues before it completes. If I can get it to run successfully, I’ll post the results here.

How would I go about improving the performance of this query?

↧

Could not locate a valid checkpoint record postgres

December 26, 2015, 9:50 am

≫ Next: need better db design for project

≪ Previous: How to optimize query with max on Postgres?

I have a database postgres in windows server and two slaves for reading.

I need do Incremental Backup.
What I’ve done:
– Create Base Backup from pg_basebackup
– Save WAL files

Restore
What remains to do:
– I deleted (manually) all folder ´data´ (ok)
– I deleted files in pg_xlog/of backup (ok)
– I copy and paste folder data generated by backup in server
– I configure recovery.conf
– I go on ‘services’ and start

This is log:

2015-11-11 17:57:18 BRT LOG: database system was interrupted; last known up at 2015-11-09 18:48:18 BRT 2015-11-11 17:57:18 BRT LOG: creating missing WAL directory "pg_xlog/archive_status" 2015-11-11 17:57:19 BRT FATAL: the database system is starting up 2015-11-11 17:57:19 BRT LOG: starting archive recovery 2015-11-11 17:57:19 BRT LOG: invalid primary checkpoint record 2015-11-11 17:57:19 BRT LOG: invalid secondary checkpoint record 2015-11-11 17:57:19 BRT PANIC: could not locate a valid checkpoint record 2015-11-11 17:57:19 BRT LOG: processo de inicialização (PID 2060) foi terminado pela exceção 0xC0000409 2015-11-11 17:57:19 BRT DICA: Veja o arquivo de cabeçalho C "ntstatus.h" para obter uma descrição do valor hexadecimal. 2015-11-11 17:57:19 BRT LOG: interrompendo inicialização porque o processo de inicialização falhou

Anyone know how to solve?

↧

need better db design for project

December 30, 2015, 9:00 pm

≫ Next: SQLAlchemy connection error with Postgres

≪ Previous: Could not locate a valid checkpoint record postgres

I have two tables:

notes
users

Notes Table:

id || description ||CreatedBy(which is userid)||......other columns
---------------------------------------

Users table:

id || name || ......other columns
---------------------------------

Now I have a scenario, in which a user can create a note and share it with
other users. Each user can create a note and then can share this with other users.

To Implement the above task, I will create a new table ‘UserNotes’.

UserNotes Table:

id || userId || NotesID
-----------------------

But suppose I have one note which I want to share with all other users then
my usernotes table will have 100 entries.
Similarly if I have 1000 notes and all notes are shared with all users then my new table will have 1000*1000 entries.

Also a user who has created a note and shared with other users, can remove the sharing also for some users.
For example: if a user creates a notes and shared with 100 users initially, he can later on remove some user associations or add some new users also.

Is my approach correct?

Or is there a better approach to achieve the above scenario considering ease of insert, update and select?

I am using POSTGRESQL as my RDBMS.

Can you please help?

↧

SQLAlchemy connection error with Postgres

December 30, 2015, 9:00 pm

≫ Next: Using an index on a Postgresql integer range causing trouble

≪ Previous: need better db design for project

I am trying to write a basic python script that uses SQLAlchemy to connect to my Postgres database. In the connection details, if I use “LOCALHOST”, it connects fine. If I reference the server by name, it throws the following error:

OperationalError: (psycopg2.OperationalError) FATAL: no pg_hba.conf entry for host “fe80::19e5:a9a0:bcc2:1c5b%11″, user “postgres”, database “postgres”, SSL off

I checked pg_hba.conf, and here’s what I have:

# IPv4 local connections:
host    all             all             127.0.0.1/32            trust

host    all             all             0.0.0.0/0               md5
# IPv6 local connections:
host    all             all             ::1/128                 trust

host    all             all             0.0.0.0/0               md5

I am able to connect to this Postgres database using other tools, like PgAdmin from a different machine, Tableau, Talend, etc.

UPDATE: I tried instead of using the name of the machine, using it’s static ipv4 IP address, which worked. The problem has something to do with the ipv6 connection.

↧

Using an index on a Postgresql integer range causing trouble

December 30, 2015, 9:00 pm

≫ Next: Migrate Postgres from VM to VM

≪ Previous: SQLAlchemy connection error with Postgres

Unfortunately I was not able to create a usable example / fiddle. But I am using a Postgresql 9.4 table that looks somewhat like this:

CREATE TABLE table1 (id INT PRIMARY KEY NOT NULL, dtype TEXT,
                     year_range INT4RANGE, geom GEOMETRY) 
CREATE INDEX year_rage_idx ON table2 USING BTREE (dtype);
CREATE INDEX year_rage_idx ON table2 USING BTREE (year_range);
CREATE INDEX table2_gix ON table2 USING GIST (geom);

table1 and table2 are basically identical. The query causing the trouble would be:

SELECT a.id, a.dtype, a.year_range, count(*)
FROM table1 AS a LEFT JOIN table2 AS b
    ON ST_DWithin(a.geom, b.geom, 500)
    AND a.dtype = b.dtype AND a.year_range = b.year_range
GROUP BY a.id, a.dtype, a.geom, a.year_range;

where geom is a PostGIS geometry, dtype a TEXT, and year_range an INT4RANGE. The above query will result in about 200 rows and the count from the LEFT JOIN will be between 0 and 200.

Recently I started using more indexes throughout my database and suddenly some parts of my program would not work properly anymore and memory use went through the roof. The above query typically took 40 seconds and the RAM use of postgresql was constant at ca. 2GB.

After adding a BTREE to the year_range column however, the query will take 2 minutes and RAM use will double to 4GB (which it usually does not for other queries).

Can you help me where to look for an issue or how to resolve this? Could it possibly be a bug?

EXPLAIN (ANALYZE, BUFFERS) for the query using CREATE INDEX ON table2 USING BTREE (year_range):

GroupAggregate  (cost=1989.80..1995.50 rows=228 width=54) (actual time=96278.148..96283.820 rows=228 loops=1)
  Group Key: a.id, a.dtype, a.geom, a.year_range
  Buffers: shared hit=16671469 read=6332935
  ->  Sort  (cost=1989.80..1990.37 rows=228 width=54) (actual time=96278.134..96278.896 rows=11281 loops=1)
        Sort Key: a.id, a.dtype, a.geom, a.year_range
        Sort Method: quicksort  Memory: 1971kB
        Buffers: shared hit=16671469 read=6332935
        ->  Nested Loop Left Join  (cost=0.43..1980.87 rows=228 width=54) (actual time=0.097..96243.523 rows=11281 loops=1)
              Buffers: shared hit=16671469 read=6332935
              ->  Seq Scan on table1 a (cost=0.00..9.28 rows=228 width=54) (actual time=0.014..0.344 rows=228 loops=1)
                    Buffers: shared read=7
              ->  Index Scan using year_range_idx on table2 b  (cost=0.43..8.64 rows=1 width=50) (actual time=25.213..422.075 rows=49 loops=228)
                    Index Cond: (a.year_range = year_range)
                    Filter: ((a.dtype = dtype) AND (a.geom && st_expand(geom, 500::double precision)) AND (geom && st_expand(a.geom, 500::double precision)) AND _st_dwithin(a.geom, geom, 500::double precision))
                    Rows Removed by Filter: 253251
                    Buffers: shared hit=16671469 read=6332928
Planning time: 0.343 ms
Execution time: 96684.861 ms

EXPLAIN (ANALYZE, BUFFERS) for the query using CREATE INDEX ON table2 USING GIST (year_range):

GroupAggregate  (cost=1994.38..2000.08 rows=228 width=54) (actual time=114066.421..114071.834 rows=228 loops=1)
  Group Key: a.id, a.dtype, a.geom, a.year_range
  Buffers: shared hit=41789804 read=5608602 written=9067
  ->  Sort  (cost=1994.38..1994.95 rows=228 width=54) (actual time=114066.412..114067.126 rows=11281 loops=1)
        Sort Key: a.id, a.dtype, a.geom, a.year_range
        Sort Method: quicksort  Memory: 1971kB
        Buffers: shared hit=41789804 read=5608602 written=9067
        ->  Nested Loop Left Join  (cost=0.41..1985.45 rows=228 width=54) (actual time=14.395..114033.344 rows=11281 loops=1)
              Buffers: shared hit=41789804 read=5608602 written=9067
              ->  Seq Scan on table1 a (cost=0.00..9.28 rows=228 width=54) (actual time=0.015..0.333 rows=228 loops=1)
                    Buffers: shared read=7
              ->  Index Scan using year_range_idx on table2 b  (cost=0.41..8.66 rows=1 width=50) (actual time=24.222..500.090 rows=49 loops=228)
                    Index Cond: (a.year_range = year_range)
                    Filter: ((a.dtype = dtype) AND (a.geom && st_expand(geom, 500::double precision)) AND (geom && st_expand(a.geom, 500::double precision)) AND _st_dwithin(a.geom, geom, 500::double precision))
                    Rows Removed by Filter: 253251
                    Buffers: shared hit=41789804 read=5608595 written=9067
Planning time: 0.343 ms
Execution time: 114073.068 ms

EXPLAIN (ANALYZE, BUFFERS) for the query using no INDEX on year range, i.e. DROP INDEX year_range_idx (check out the time: 186ms?!):

GroupAggregate  (cost=136824.31..136830.01 rows=228 width=54) (actual time=180.088..186.006 rows=228 loops=1)
  Group Key: a.id, a.dtype, a.geom, a.year_range
  Buffers: shared hit=71091 read=4763
  ->  Sort  (cost=136824.31..136824.88 rows=228 width=54) (actual time=180.074..180.962 rows=11281 loops=1)
        Sort Key: a.id, a.dtype, a.geom, a.year_range
        Sort Method: quicksort  Memory: 1971kB
        Buffers: shared hit=71091 read=4763
        ->  Nested Loop Left Join  (cost=29.15..136815.38 rows=228 width=54) (actual time=0.245..159.873 rows=11281 loops=1)
              Buffers: shared hit=71091 read=4763
              ->  Seq Scan on table1 a (cost=0.00..9.28 rows=228 width=54) (actual time=0.013..0.128 rows=228 loops=1)
                    Buffers: shared hit=1 read=6
              ->  Bitmap Heap Scan on table2 b  (cost=29.15..600.02 rows=1 width=50) (actual time=0.169..0.689 rows=49 loops=228)
                    Recheck Cond: (geom && st_expand(a.geom, 500::double precision))
                    Filter: ((a.dtype = dtype) AND (a.year_range = year_range) AND (a.geom && st_expand(geom, 500::double precision)) AND _st_dwithin(a.geom, geom, 500::double precision))
                    Rows Removed by Filter: 257
                    Heap Blocks: exact=69630
                    Buffers: shared hit=71090 read=4757
                    ->  Bitmap Index Scan on table2_gix  (cost=0.00..29.15 rows=140 width=0) (actual time=0.124..0.124 rows=307 loops=228)
                          Index Cond: (geom && st_expand(a.geom, 500::double precision))
                          Buffers: shared hit=4412 read=1805
Planning time: 0.301 ms
Execution time: 186.366 ms

↧

Migrate Postgres from VM to VM

January 1, 2016, 10:50 pm

≫ Next: Optimize a query with small LIMIT, predicate on one column and order by another

≪ Previous: Using an index on a Postgresql integer range causing trouble

I have 2 VMs, both are identical in structural terms (came from the same snapshot, nothing installed, upgraded or removed).

One VM is “clean”, never used.

The other VMs was heavily used and it has some data stored in the Postgres DB.

Is there a way to migrate the Postgres database from the heavily used VM to the clean VM just by copying the DB files from a specific folder?

For instance, let’s say that Postgres (I’m assuming this part) writes all its files in the directory XYZ. If I copy all files from that XYZ directory to the clean VM, will Postgres work?

I’m not worried if that is the correct way to do that or not. This is just a PoC and I need to do it with the least effort possible and copying files around seems to be the best fit here.

↧

Optimize a query with small LIMIT, predicate on one column and order by another

January 1, 2016, 10:50 pm

≫ Next: Visualising a raster PostgreSQL table in QGIS

≪ Previous: Migrate Postgres from VM to VM

I’m using Postgres 9.3.4 and I have 4 queries that have very similar inputs but have vastly different response times:

Query #1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..585.44 rows=100 width=1041) (actual time=326092.852..507360.199 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14871916.35 rows=2542166 width=1041) (actual time=326092.301..507359.524 rows=100 loops=1)
         Filter: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
         Rows Removed by Filter: 6913925
 Total runtime: 507361.944 ms

Query #2

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;                                            

    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1)
   ->  Sort  (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1)
         Sort Key: external_created_at
         Sort Method: top-N heapsort  Memory: 80kB
         ->  Index Scan using index_posts_on_source_id on posts  (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1)
               Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[]))
               Filter: (deleted_at IS NULL)
               Rows Removed by Filter: 109
 Total runtime: 2.125 ms

Query #3

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                             QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..821.25 rows=100 width=1041) (actual time=19.224..55.599 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14930351.58 rows=1818959 width=1041) (actual time=19.213..55.529 rows=100 loops=1)
         Filter: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
         Rows Removed by Filter: 414
 Total runtime: 55.683 ms

Query #4

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc
LIMIT 100 OFFSET 0;
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..69055.29 rows=100 width=1041) (actual time=26.094..320.626 rows=100 loops=1)
   ->  Index Scan using index_posts_on_external_created_at on posts  (cost=0.43..14930351.58 rows=21621 width=1041) (actual time=26.093..320.538 rows=100 loops=1)
         Filter: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[]))
         Rows Removed by Filter: 6156
 Total runtime: 320.778 ms

All 4 are the same, apart from looking at posts with different source_ids.

Three of the four end up using the following index:

CREATE INDEX index_posts_on_external_created_at ON posts USING btree (external_created_at DESC)
WHERE (deleted_at IS NULL);

And the #2 uses this index:

CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);

What’s interesting to me, is that of the 3 that use the index_posts_on_external_created_at index, two are quite fast, while the other (#1) is insanely slow.

Query #2 has way less posts than the other 3 do, so that might explain why it uses the index_posts_on_source_id index instead. However, if I get rid of the index_posts_on_external_created_at index, the other 3 queries are extremely slow when forced to use the index_posts_on_source_id index.

Here’s my definition of the posts table:

CREATE TABLE posts (
    id integer NOT NULL,
    source_id integer,
    message text,
    image text,
    external_id text,
    created_at timestamp without time zone,
    updated_at timestamp without time zone,
    external text,
    like_count integer DEFAULT 0 NOT NULL,
    comment_count integer DEFAULT 0 NOT NULL,
    external_created_at timestamp without time zone,
    deleted_at timestamp without time zone,
    poster_name character varying(255),
    poster_image text,
    poster_url character varying(255),
    poster_id text,
    position integer,
    location character varying(255),
    description text,
    video text,
    rejected_at timestamp without time zone,
    deleted_by character varying(255),
    height integer,
    width integer
);

I’ve tried using CLUSTER posts USING index_posts_on_external_created_at

Which is essentially an index that orders by external_created_at and this seems to be the only effective method I’ve found. However, I am unable to use this on production as it causes a global lock for several hours while it runs. I’m on heroku, so I can’t install pg_repack or anything like that.

Why would the #1 query be so slow, and others be really quick? What can I do to mitigate this?

EDIT: Here are my queries without the LIMIT and ORDER

Query #1

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7455044.81..7461163.56 rows=2447503 width=1089) (actual time=94903.143..95110.898 rows=238975 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 81440kB
   ->  Bitmap Heap Scan on posts  (cost=60531.78..1339479.50 rows=2447503 width=1089) (actual time=880.150..90988.460 rows=238975 loops=1)
         Recheck Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
         Rows Removed by Index Recheck: 5484857
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 3108465
         ->  Bitmap Index Scan on index_posts_on_source_id  (cost=0.00..59919.90 rows=3267549 width=0) (actual time=877.904..877.904 rows=3347440 loops=1)
               Index Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[]))
 Total runtime: 95534.724 ms

Query #2

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=36913.72..36935.85 rows=8852 width=1089) (actual time=212.450..212.549 rows=944 loops=1)
   Sort Key: external_created_at
   Sort Method: quicksort  Memory: 557kB
   ->  Index Scan using index_posts_on_source_id on posts  (cost=0.44..32094.90 rows=8852 width=1089) (actual time=1.732..209.590 rows=944 loops=1)
         Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[]))
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 109
 Total runtime: 214.507 ms

Query #3

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=5245032.87..5249894.14 rows=1944508 width=1089) (actual time=131032.952..134342.372 rows=1674072 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 854864kB
   ->  Bitmap Heap Scan on posts  (cost=48110.86..1320005.55 rows=1944508 width=1089) (actual time=605.648..91351.334 rows=1674072 loops=1)
         Recheck Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
         Rows Removed by Index Recheck: 5304550
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 879414
         ->  Bitmap Index Scan on index_posts_on_source_id  (cost=0.00..47624.73 rows=2596024 width=0) (actual time=602.744..602.744 rows=2553486 loops=1)
               Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[]))
 Total runtime: 136176.868 ms

Query #4

EXPLAIN ANALYZE SELECT posts.* FROM posts
WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330)
AND posts.deleted_at IS NULL
ORDER BY external_created_at desc;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=102648.92..102704.24 rows=22129 width=1089) (actual time=15225.250..15256.931 rows=51408 loops=1)
   Sort Key: external_created_at
   Sort Method: external merge  Disk: 35456kB
   ->  Index Scan using index_posts_on_source_id on posts  (cost=0.45..79869.91 rows=22129 width=1089) (actual time=3.975..14803.320 rows=51408 loops=1)
         Index Cond: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[]))
         Filter: (deleted_at IS NULL)
         Rows Removed by Filter: 54
 Total runtime: 15397.453 ms

Postgres memory settings:

name, setting, unit
'default_statistics_target','100',''
'effective_cache_size','16384','8kB'
'maintenance_work_mem','16384','kB'
'max_connections','100',''
'random_page_cost','4',NULL
'seq_page_cost','1',NULL
'shared_buffers','16384','8kB'
'work_mem','1024','kB'

Database stats:

Total Posts: 20,997,027
Posts where deleted_at is null: 15,665,487
Distinct source_id's: 22,245
Max number of rows per single source_id: 1,543,950
Min number of rows per single source_id: 1
Most source_ids in a single query: 21
Distinct external_created_at: 11,146,151

↧

Visualising a raster PostgreSQL table in QGIS

January 1, 2016, 11:00 pm

≫ Next: pg_stat_statements not found even with “shared_preload_libraries = 'pg_stat_statements'

≪ Previous: Optimize a query with small LIMIT, predicate on one column and order by another

I have imported a raster tif image into PostgreSQL with this command:

raster2pgsql -s 32643 -I -M filepath.tif -F -t 100x100 public.databassename > filepath.sql

And imported the output SQL file inside the PostgreSQL database running this:

psql -U postgres -d databasename -f filepath.sql

After connecting to this database in QGIS, I am not able to visualise the tif image due to missing geometry content. Please help me out to visualise the tif raster imagery in QGIS.

↧

pg_stat_statements not found even with “shared_preload_libraries = 'pg_stat_statements'

January 2, 2016, 12:00 am

≫ Next: Understanding row estimation for timestamp in postgresql

≪ Previous: Visualising a raster PostgreSQL table in QGIS

I have followed the instructions given below:

http://www.postgresql.org/docs/9.3/static/pgstatstatements.html

… to the effect of adding the following lines:

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all

In file postgresql.conf and then restarting the server but the table pg_stat_statements is still not visible:

$ cat /usr/share/postgresql/9.3/postgresql.conf | grep -A3 ^shared_preload_libraries
shared_preload_libraries = 'pg_stat_statements' # (change requires restart)
pg_stat_statements.max = 10000
pg_stat_statements.track = all


$ sudo /etc/init.d/postgresql restart
* Restarting PostgreSQL 9.3 database server          [ OK ] 

$ psql -U postgres
psql (9.3.10)
Type "help" for help.

postgres=# select count(*) from pg_stat_activity;
count 
-------
      1
(1 row)

postgres=# select count(*) from pg_stat_statements;
ERROR:  relation "pg_stat_statements" does not exist

update

After executing:

sudo apt-get install postgresql-contrib-9.3

And then doing:

$ psql -U postgres
psql (9.4.5, server 9.3.10)
Type "help" for help.

postgres=# create extension pg_stat_statements;
CREATE EXTENSION
postgres=# dx
                                     List of installed extensions
         Name        | Version |   Schema   |                        Description                        
--------------------+---------+------------+-----------------------------------------------------------
 pg_stat_statements | 1.1     | public     | track execution statistics of all SQL statements executed
 plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
 (2 rows)

 postgres=# quit
$ sudo /etc/init.d/postgresql restart
 * Restarting PostgreSQL 9.3 database server  [OK]

… I now get:

postgres=# select * from pg_stat_statements ;
ERROR:  pg_stat_statements must be loaded via shared_preload_libraries

system details

I am running on Ubuntu 14.04.03 LTS. PostgreSQL was installed with apt-get install.

log trace of PostgreSQL during a restart

While doing a sudo /etc/init.d/postgresql restart I get the following log trace:

$ tail -f /var/log/postgresql/postgresql-9.3-main.log
2015-12-21 11:11:31 EET [23790-2] LOG:  received fast shutdown request
2015-12-21 11:11:31 EET [23790-3] LOG:  aborting any active transactions
2015-12-21 11:11:31 EET [7231-1] esavo-user@RegTAP FATAL:  terminating connection due to administrator command
2015-12-21 11:11:31 EET [23903-5] postgres@postgres FATAL:  terminating connection due to administrator command
2015-12-21 11:11:31 EET [23822-7] esavo-user@RegTAP FATAL:  terminating connection due to administrator command
2015-12-21 11:11:31 EET [23795-2] LOG:  autovacuum launcher shutting down
2015-12-21 11:11:31 EET [23815-1] esavo-user@RegTAP FATAL:  terminating connection due to administrator command
2015-12-21 11:11:31 EET [23792-1] LOG:  shutting down
2015-12-21 11:11:31 EET [23792-2] LOG:  database system is shut down
2015-12-21 11:11:32 EET [16886-1] LOG:  database system was shut down  at 2015-12-21 11:11:31 EET
2015-12-21 11:11:32 EET [16886-2] LOG:  MultiXact member wraparound  protections are now enabled
2015-12-21 11:11:32 EET [16885-1] LOG:  database system is ready to accept connections
2015-12-21 11:11:32 EET [16890-1] LOG:  autovacuum launcher started
2015-12-21 11:11:33 EET [16892-1] [unknown]@[unknown] LOG:  incomplete startup packet

↧