I want to write a query using Postgres and PostGIS. I'm also using Rails with rgeo, rgeo-activerecord and activerecord-postgis-adapter, but the Rails stuff is rather unimportant.
The table structure:
measurement
 - int id
 - int anchor_id
 - Point groundtruth
 - data (not important for the query)
Example data:
id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 2         | POINT(1 4)  | ...
4  | 3         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 1         | POINT(4 3)  | ...
10 | 5         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...
This table is some kind of manually created view for faster lookups (with millions of rows). Else we'd have to join 8 tables and it would get even slower. But that's not part of the problem.
Simple version:
Parameters:
- Point p
- int d
What the query should do:
1. The query looks for all groundtruth Points which have a distance < d from Point p
SQL for that is pretty easy: WHERE st_distance(groundtruth, p) < d
2. Now we have a list of groundtruth points with their anchor_ids. As you can see in the table above, it is possible to have multiple identical groundtruth-anchor_id tuples. For example: anchor_id=3 and groundtruth=POINT(1 4).
3. Next I'd like to eliminate the identical tuples, by choosing one of them randomly(!). Why not simply take the first? Because the data column is different.
Choosing a random row in SQL: SELECT ... ORDER BY RANDOM() LIMIT 1
My problem with all of this is: I can imagine a solution using SQL LOOPs and lot's of subqueries, but there's for sure a solution using GROUP BY or some other methods which will make it faster.
Full version:
Basically the same as above with one difference: The input parameters change:
- lot's of Points p1...p312456345
- still one d
If the simple query is working, this could be done using a LOOP in SQL. But maybe there is a better (and faster) solution, because the database is really huge!
Solution
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
    *
FROM measurement m, ps
WHERE EXISTS (
    SELECT 1
    FROM ps
    WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();
Thanks to Erwin Brandstetter!
 
    