Related, preceding question:
Select a random entry from a group after grouping by a value (not column)?
My current query looks like this:
WITH
  points AS (
    SELECT unnest(array_of_points) AS p
  ),
 gtps AS (
   SELECT DISTINCT ON(points.p)
     points.p, m.groundtruth
   FROM measurement m, points
   WHERE st_distance(m.groundtruth, points.p) < distance
   ORDER BY points.p, RANDOM()
 )
SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
  m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()
Semantics:
- there are two input values: - Line 4: an array of Points array_of_points
- Line 12: a double precision number: distance
 
- Line 4: an array of Points 
- First paragraph (lines 1-6): - Create a table from the points array for use in...
 
- Second paragraph (lines 8-14): - For each point inside the pointstable: get a random(!)groundtruthpoint from themeasurementtable, that has a distance <distance
- Save those tuples inside the gtpstable
 
- For each point inside the 
- Third paragraph (lines 16-19): - For each groundtruthvalue inside thegtpstable: get allanchor_idvalues and...
- If an anchor_idvalue is not unique: Then choose a random one
 
- For each 
- Output: - id,- anchor_id,- groundtruth,- p(input value from the- array_of_points)
Example table:
id | anchor_id | groundtruth | data
-----------------------------------
1  | 1         | POINT(1 4)  | ...
2  | 3         | POINT(1 4)  | ...
3  | 8         | POINT(1 4)  | ...
4  | 6         | POINT(1 4)  | ...
-----------------------------------
5  | 2         | POINT(3 2)  | ...
6  | 4         | POINT(3 2)  | ...
-----------------------------------
7  | 1         | POINT(4 3)  | ...
8  | 1         | POINT(4 3)  | ...
9  | 6         | POINT(4 3)  | ...
10 | 7         | POINT(4 3)  | ...
11 | 3         | POINT(4 3)  | ...
-----------------------------------
12 | 1         | POINT(6 2)  | ...
13 | 5         | POINT(6 2)  | ...
Example result:
id  | anchor_id | groundtruth | p
-----------------------------------------
1   | 1         | POINT(1 4)  | POINT(1 0)
2   | 3         | POINT(1 4)  | POINT(1 0)
4   | 6         | POINT(1 4)  | POINT(1 0)
3   | 8         | POINT(1 4)  | POINT(1 0)
5   | 2         | POINT(3 2)  | POINT(2 2)
6   | 4         | POINT(3 2)  | POINT(2 2)
1   | 1         | POINT(1 4)  | POINT(4 8)
2   | 3         | POINT(1 4)  | POINT(4 8)
4   | 6         | POINT(1 4)  | POINT(4 8)
3   | 8         | POINT(1 4)  | POINT(4 8)
12  | 1         | POINT(6 2)  | POINT(7 3)
13  | 5         | POINT(6 2)  | POINT(7 3)
1   | 1         | POINT(4 3)  | POINT(9 1)
11  | 3         | POINT(4 3)  | POINT(9 1)
9   | 6         | POINT(4 3)  | POINT(9 1)
10  | 7         | POINT(4 3)  | POINT(9 1)
As you can see:
- Each input value can have multiple equal groundtruthvalues.
- If an input value has multiple groundtruthvalues, those must all be equal.
- Each groundtruth-inputPoint-tuple is connected with every possilbe anchor_idfor that groundtruth.
- Two different input values can have the same corresponding groundtruthvalue.
- Two distinct groundtruth-inputPoint-tuples can have the same anchor_id
- Two indentical groundtruth-inputPoint-tuples must have different anchor_ids
Benchmarks (for two input values):
- Lines 1-6: 16ms
- Lines 8-14: 48ms
- Lines 16-19: 600ms
EXPLAIN VERBOSE:
Unique  (cost=11119.32..11348.33 rows=18 width=72)
  Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0)
          Output: unnest('{0101000000EE7C3F355EF24F4019390B7BDA011940:01010000003480B74082FA44402CD49AE61D173C40}'::geometry[])
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160)
          Output: points.p, m.groundtruth, (random())
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160)
                Output: points.p, m.groundtruth, (random())
                Sort Key: points.p, (random())
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160)
                      Output: points.p, m.groundtruth, random()
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32)
                            Output: points.p
                      ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=132)
                            Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72)
        Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, (random())
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72)
              Output: m.id, m.anchor_id, gtps.groundtruth, gtps.p, random()
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64)
                    Output: gtps.p, gtps.groundtruth
              ->  Seq Scan on public.measurement m  (cost=0.00..535.01 rows=22901 width=8)
                    Output: m.id, m.anchor_id, m.tag_node_id, m.experiment_id, m.run_id, m.anchor_node_id, m.groundtruth, m.distance, m.distance_error, m.distance_truth, m."timestamp"
EXPLAIN ANALYZE:
Unique  (cost=11119.32..11348.33 rows=18 width=72) (actual time=548.991..657.992 rows=36 loops=1)
  CTE points
    ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.004..0.011 rows=2 loops=1)
  CTE gtps
    ->  Unique  (cost=7659.95..7698.12 rows=1 width=160) (actual time=133.416..146.745 rows=2 loops=1)
          ->  Sort  (cost=7659.95..7679.04 rows=7634 width=160) (actual time=133.415..142.255 rows=15683 loops=1)
                Sort Key: points.p, (random())
                Sort Method: external merge  Disk: 1248kB
                ->  Nested Loop  (cost=0.00..6565.63 rows=7634 width=160) (actual time=0.045..46.670 rows=15683 loops=1)
                      Join Filter: (st_distance(m.groundtruth, points.p) < m.distance)
                      ->  CTE Scan on points  (cost=0.00..0.02 rows=1 width=32) (actual time=0.007..0.020 rows=2 loops=1)
                      ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=132) (actual time=0.013..3.902 rows=22901 loops=2)
  ->  Sort  (cost=3421.18..3478.43 rows=22901 width=72) (actual time=548.989..631.323 rows=45802 loops=1)
        Sort Key: gtps.p, gtps.groundtruth, m.anchor_id, (random())"
        Sort Method: external merge  Disk: 4008kB
        ->  Nested Loop  (cost=0.00..821.29 rows=22901 width=72) (actual time=133.449..166.294 rows=45802 loops=1)
              ->  CTE Scan on gtps  (cost=0.00..0.02 rows=1 width=64) (actual time=133.420..146.753 rows=2 loops=1)
              ->  Seq Scan on measurement m  (cost=0.00..535.01 rows=22901 width=8) (actual time=0.014..4.409 rows=22901 loops=2)
Total runtime: 834.626 ms
When running live this should run with about 100-1000 input values. So for now it would take 35 to 350 seconds which is far to much.
I already tried to remove the RANDOM() functions. This decreases the runtime (for 2 input values) from about 670ms to about 530ms. So this isn't the main impact at the moment.
It's also possible to run 2 or 3 separate queries and do some parts in software (it's running on a Ruby on Rails server) if that's easier/faster. For example the random selection?!
Work in progress:
SELECT
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY groundtruth, ps.p
With this query it is extremely fast (15ms), but there's missing a lot:
- I just need a random row for each ps.p
- The two arrays belong to each other. Means: the order of the items inside is important!
- Those two arrays need to get filtered (randomly):
 For eachanchor_idin the array that appears more than once: keep a random one and delete all other. This also means to remove the correspondingidfrom theid-array for every deletedanchor_id
It would also be nice if anchor_id and id could be stored inside an array of tuples. For example: {[4,1],[6,3],[4,2],[8,5],[4,4]} (constraints: every tuple is unique, every id (== 2nd value in the example) is unique, anchor_ids are not unique). This example displays the query without the filters that still must be applied. With the filters applied, it would look like this {[6,3],[4,4],[8,5]}.
Work in progress II:
SELECT DISTINCT ON (ps.p)
  m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
  measurement m
JOIN
  (SELECT unnest(point_array) AS p) AS ps
  ON ST_DWithin(ps.p, m.groundtruth, distance)
GROUP BY ps.p, m.groundtruth
ORDER BY ps.p, RANDOM()
This now give pretty nice results and is still very fast: 16ms
There's just one thing left to do:
- ARRAY_AGG(m.anchor_id)is already randomized, but:
- it contains lots of duplicate entries, so:
- I'd like to use something like DISTINCT on it, but:
- it has to be synchronized with ARRAY_AGG(m.id). This means:
 If the DISTINCT command keeps the indices 1, 4 and 7 of theanchor_idarray, then it has also to keep indices 1, 4 and 7 of theidarray (and of course delete all others)
 
     
    