I’m trying to write a query to detect possibly-invalid data in a PostgreSQL table. We have a table of city names like this:
# `city_names`
id | name | language | dialect | city_id
------------------------------------------------
01 | London | A | A1 | 1
02 | London | A | A2 | 1
03 | London | B | B1 | 2
04 | London | B | B2 | 3
In our domain:
- It’s fine that rows 01 and 02 both map “London” to city 1; the dialects don’t happen to differ
- It’s fine that row 03 maps “London” to city 2; in that language, the name may refer to a different city
- It’s suspicious that row 04 maps “London” to city 3, because we already have a mapping to city 2 in the same language
I want to write a query that selects only rows 03 and 04 so that a human can decide whether one of them points to the wrong city.
I can solve this problem procedurally, but I’m having trouble doing it in SQL. For example, if I GROUP BY
language and name, I lose the city_id
values from the individual rows.
Basically my goal is: “If there’s more than one city_id for the same name and language, list those city_ids.”
How can I do this?