I have to randomly match about 20 million rows in two tables without using the same row twice using PostgreSQL 9.3.5. Both tables have the following structure
CREATE TABLE de_sim_points_end
(
-- Inherited from table de_sim_points: id integer NOT NULL DEFAULT nextval('de_sim_points_id_seq'::regclass),
-- Inherited from table de_sim_points: parent_geometry character varying(12),
-- Inherited from table de_sim_points: used boolean DEFAULT false,
-- Inherited from table de_sim_points: geom geometry(Point,900913),
CONSTRAINT de_sim_points_end_pkey PRIMARY KEY (id)
)
INHERITS (de_sim_points)
WITH (
FILLFACTOR=50,
OIDS=FALSE
);
ALTER TABLE de_sim_points_end
OWNER TO benjamin;
CREATE INDEX de_sim_points_end_geom_idx
ON de_sim_points_end
USING gist
(geom)
WITH (FILLFACTOR=100);
ALTER TABLE de_sim_points_end CLUSTER ON de_sim_points_end_geom_idx;
CREATE INDEX de_sim_points_end_parent_relation_idx
ON de_sim_points_end
USING btree
(parent_geometry COLLATE pg_catalog."default")
WITH (FILLFACTOR=100);
CREATE INDEX de_sim_points_end_used_idx
ON de_sim_points_end
USING btree
(used)
WITH (FILLFACTOR=50);
The SQL Query and Python Code used to match the rows can be found at https://github.com/boerngen-schmidt/commuter-simulation/blob/master/code/builder/process_point_mass_matcher.py (I will not post it here due to its length). But basically what I do is:
- SELECT rows, randomly ordered from start point table
- JOIN rows, randomly ordered from destination point table
- Take the result, iterate over it and update the points in start and destination table as used
- Save the result in another table (done by another thread)
When I was using multiple processes with SELECT … FOR UPDATE I ran into deadlocks (currently I just use one process, so the FOR UPDATE statement is currently omitted). So far I have read a bit about SERIALIZABLE and so forth locking mechanisms, but yet I fail to fully grasp them.
Is it possible to make the matching multiprocessing compatible? How whould the statements look like?