I've looked at a lot of questions trying to figure this out and finally got a query that does what I want. But I'm wondering if there's a better way of doing this? Essentially, I have a bunch of tables that get joined together due to a belongs_to relationship and I want the number of rows being returned to only be as many as the largest number of rows in one of these belongs_to tables.
Since that's kind of awkwardly worded, here's an example. Say I have 2 recipes and each have steps and some nutrition. Recipe 1 has 3 steps and 3 nutritions, and recipe 2 has 2 steps and 4 nutritions. For Recipe 1 at most 3 rows should be returned, and for recipe 2 at most 4 rows should be returned. Here's a fiddle with the data: http://sqlfiddle.com/#!9/bcce59/2
In case the fiddle doesn't work for some reason here's the table schema:
CREATE TABLE recipe
    (`id` int PRIMARY KEY, `title` varchar(64))
;
CREATE TABLE step
    (`rid` int, `instruction` varchar(64),
    FOREIGN KEY(rid) REFERENCES recipe(id) )
;
CREATE TABLE nutrition
    (`rid` int, `name` varchar(64), `amount` int,
    FOREIGN KEY(rid) REFERENCES recipe(id) )
;
And here's some sample data:
INSERT INTO recipe
    (`id`, `title`)
VALUES
    (1, 'Cookies'),
    (2, 'Bananas')
;
INSERt INTO step
    (`rid`, `instruction`)
VALUES
    (1, 'Unwrap'),
    (1, 'Dip in milk'),
    (1, 'Eat'),
    (2, 'Peal'),
    (2, 'Eat')
;
INSERT INTO nutrition
    (`rid`, `name`, `amount`)
VALUES
    (1, 'calories', 120),
    (1, 'sugar', 300),
    (1, 'fat', 50),
    (2, 'calories', 50),
    (2, 'sugar', 50),
    (2, 'fat', 20),
    (2, 'carb', 30)
;
Now, I thought I might be able to do this with a group by at first. But things like
SELECT id, title, instruction, name, amount FROM
recipe 
LEFT JOIN step ON recipe.id = step.rid
LEFT JOIN nutrition on recipe.id = nutrition.rid
GROUP BY id, instruction, name, amount;
Will return 17 rows since it's a product and the number of unique pairings of the group by columns is 9 for recipe 1 and 8 for recipe 2. So that's out. After a lot of searching amongst the tags and pouring over MySQL documentation and a cookbook book I have, I came up with the following query which does do the job:
SELECT id, title, instruction, name, amount FROM 
(
    SELECT 
        id, 
        title, 
        instruction, 
        name, 
        amount
    FROM recipe 
        LEFT JOIN step ON recipe.id = step.rid
        LEFT JOIN nutrition on recipe.id = nutrition.rid
) data
INNER JOIN 
(
    SELECT 
        s.rid,
        CASE
            WHEN
                GREATEST(numSteps, numNutrition) = numSteps
            THEN instruction
            WHEN
                GREATEST(numSteps, numNutrition) = numNutrition
            THEN name
        END as row
    FROM 
    (
        SELECT
            rid,
            instruction
        FROM step GROUP BY rid, instruction 
    ) s
    LEFT JOIN
        (
            SELECT
                rid,
                name
            FROM nutrition GROUP BY rid, name
        ) n
    ON s.rid = n.rid
    LEFT JOIN
        (
            SELECT rid, COUNT(*) as numNutrition 
            FROM nutrition GROUP BY rid
        ) nSum
    ON n.rid = nSum.rid
    LEFT JOIN
        (
            SELECT rid, COUNT(*) as numSteps 
            FROM step GROUP BY rid
        ) sSum
    ON s.rid = sSum.rid
    GROUP by rid, row
) biggest 
ON data.id = biggest.rid 
GROUP BY data.id, biggest.row
;
However, generalizing my baby example of 2 belongs_to tables to my actual database which has more than 20 tables to join makes me concerned. My real data has between 15k - 90k rows per 'recipe' when using the naive joining method, so I'm concerned about both the performance of the query and that I might just be missing something very basic and simple that will help solve this problem. I don't really want to write a stored procedure to do this, though I do wonder if a view table would make sense? My question is
- Is there a way to write the above query in a better/more performant way?
- Would it make sense to construct a view table or something like that to effectively cache the results of a possibly long and painful query?
Apologies for the odd question title, I'm not sure how to succinctly phrase what I'm doing with this query.
I realized my fiddle isn't producing the right data, so here's an edit to make it clear what the final result set of a query should be:
+----+---------+-------------+----------+--------+
| id | title   | instruction | name     | amount |
+----+---------+-------------+----------+--------+
|  1 | Cookies | Unwrap      | calories |    120 |
|  1 | Cookies | Dip in milk | sugar    |    300 |
|  1 | Cookies | Eat         | fat      |     50 |
|  2 | Bananas | Peel        | calories |     50 |
|  2 | Bananas | Peel        | sugar    |     50 |
|  2 | Bananas | Eat         | fat      |     20 |
|  2 | Bananas | Eat         | carb     |     30 |
+----+---------+-------------+----------+--------+
7 rows in set (0.00 sec)
Something like this, with each instruction/nutrition value appearing at least once in the result set. With duplicates allowed for those columns which do not have the largest number of rows compared to the other belongs_to tables.
