Connect by query

Question

I'm storing hierarchical data in a table. When a resource is accessed by its hierarchical path (grantParent/parent/resource), I need to locate the resource using a CONNECT BY query.

Note: SQL commands are exported from EnterpriseDB, but it should work in Oracle too.

Table structure:

CREATE TABLE resource_hierarchy
(
  resource_id character varying(100) NOT NULL,
  resource_type integer NOT NULL,
  resource_name character varying(100),
  parent_id character varying(100)
)
WITH (
  OIDS=FALSE
);

Data:

INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('36d27991', 3, 'areaName',    'a616f392');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('a616f392', 3, 'townName',    'fcc1ebb7');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('fcc1ebb7', 2, 'stateName',   '8369cc88');
INSERT INTO "resource_hierarchy" (resource_id,resource_type,resource_name,parent_id) VALUES ('8369cc88', 5, 'countryName', null);

Now, when I receive a path like

countryName/stateName/townName/areaName

I'm executing a query like,

select LEVEL,* from resource_hierarchy
WHERE resource_name = (
            CASE LEVEL 
                WHEN 1 THEN 'areaName'
                WHEN 2 THEN 'townName'
                WHEN 3 THEN 'stateName'
                WHEN 4 THEN 'countryName'
                ELSE ''
            END
         )
 connect by prior parent_id = resource_id
 start with resource_name = 'areaName';

My expected results are:

LEVEL   resource_id resource_type   resource_name   parent_id
-------------------------------------------------------------
1       36d27991    3               areaName        a616f392
2       a616f392    3               townName        fcc1ebb7
3       fcc1ebb7    2               stateName       8369cc88
4       8369cc88    5               countryName     <null>

This query works fine, but I'm not sure if it would run faster, when my table is big like hundreds of thousands of entries.

Can you optimize this query for my requirement?

Edited:

EXPLAIN for the above query: I've defined two indices - one on resource_id (primary key) and another on parent_id

Sort  (cost=66.85..66.86 rows=1 width=694)
  Sort Key: connectby_cte.siblingssortcol
  CTE prior
    ->  Recursive Union  (cost=0.00..65.83 rows=31 width=151)
      ->  WindowAgg  (cost=0.00..3.12 rows=1 width=83)
        ->  Seq Scan on resource_hierarchy  (cost=0.00..3.11 rows=1 width=83)
              Filter: ((resource_name)::text = 'areaName'::text)
      ->  WindowAgg  (cost=0.33..6.21 rows=3 width=151)
        ->  Hash Join  (cost=0.33..6.15 rows=3 width=151)
              Hash Cond: ((resource_hierarchy_1.resource_id)::text = (prior.parent_id)::text)
              Join Filter: connectby_cyclecheck(prior.recursionpath, (resource_hierarchy_1.parent_id)::text)
              ->  Seq Scan on resource_hierarchy resource_hierarchy_1  (cost=0.00..2.89 rows=89 width=83)
              ->  Hash  (cost=0.20..0.20 rows=10 width=286)
                ->  WorkTable Scan on prior  (cost=0.00..0.20 rows=10 width=286)
  ->  CTE Scan on prior connectby_cte  (cost=0.00..1.01 rows=1 width=694)
    Filter: ((resource_name)::text = CASE level WHEN 1 THEN 'areaName'::text WHEN 2 THEN 'townName'::text WHEN 3 THEN 'stateName'::text WHEN 4 THEN 'countryName'::text ELSE ''::text END)

What if you get a path like countryName/stateName/townName ? — I_am_Batman, Apr 14 '16 at 16:00
It should still be supported. Results would contain last three entries (excluding areaName) — Karthik Murugan, Apr 14 '16 at 16:28
Please execute and check. Does it work in your query? Output is as expected? Just checking if we are on same page..I replaced resource_name='townName', got nothing. — I_am_Batman, Apr 14 '16 at 17:07
Also remove this line " WHEN 1 THEN 'areaName' " and decrement subsequent WHEN values by 1. For example, WHEN 1 THEN 'townName' WHEN 2 THEN 'stateName' and so on... — Karthik Murugan, Apr 14 '16 at 17:13
Hi Karthik, You said "this query works fine" which means you tested it. Did you test it with an actual input in the "path" format you posted? I ask because - at least in the query you posted - I don't see where you separated just the areaName from the longer string. If done wrong, that part of your query may take an unnecessarily long time, especially if you need to rig it so you can accept either a four-part path or a three-part path (down to townName) as discussed in comments. — , Apr 17 '16 at 11:29
@Paparazzi - I think I know what Paparazzi meant because I have the same question. Is it possible that areaName is not unique? If it is not unique, do you want to return all the possibilities? This may return lots of data (for example if you start with townName = 'Springfield' in the United States). — , Apr 17 '16 at 11:33
@mathguy : If configured properly, below query does exactly that. Resource_name should be 'Springfield', whose id can be tagged to corresponding state name. And yes, it returns all hierarchies, based on levels. Karthik: This is where I got stuck too — I_am_Batman, Apr 17 '16 at 11:52
No doubt, I didn't mean this should cause difficulties writing a query - just pointing out that the specification needs to be clarified. Usually you would be given a unique id, but the OP seems to want to start from actual names. Just making sure we are all on the same page and EXPECT that sometimes the result may be many-fold. — , Apr 17 '16 at 11:58
Hi, the names in the path are not unique, but the combination is unique. The name is unique under the parent. i.e, under country, the state names are unique and under a state, the city names are unique. — Karthik Murugan, Apr 17 '16 at 16:13
I'm generating this query dynamically from my Java code. By splitting the URL by slash, I can get the URL parts and then generate the query like I mentioned — Karthik Murugan, Apr 17 '16 at 16:17

ThinkJet · Accepted Answer · 2016-04-19T12:20:26.630

Disclaimer: My primary experience belongs to Oracle DBMS, so pay attention to details if applying solution to Postgres.

Where clause applied after full hierarchy already built, therefore in original query database engine started retrieving data with specified resource_name at any level and building a full tree for each found record. Filtering occurs only on the next step.
Documentation:

Oracle selects the root row(s) of the hierarchy—those rows that satisfy the START WITH condition.

Oracle selects the child rows of each root row. Each child row must satisfy the condition of the CONNECT BY condition with respect to one of the root rows.

Oracle selects successive generations of child rows. Oracle first selects the children of the rows returned in step 2, and then the children of those children, and so on. Oracle always selects children by evaluating the CONNECT BY condition with respect to a current parent row.

If the query contains a WHERE clause without a join, then Oracle eliminates all rows from the hierarchy that do not satisfy the condition of the WHERE clause. Oracle evaluates this condition for each row individually, rather than removing all the children of a row that does not satisfy the condition.

To optimize this situation query must be changed as follows(hierarchy reversed to more natural top-down order):

select 
  level, rh.* 
from 
  resource_hierarchy rh
start with 
  (resource_name = 'countryName')
  and 
  (parent_id is null) -- roots only
connect by 
  prior resource_id = parent_id
  and          
  -- at each step get only required records
  resource_name = (
    case level 
      when 1 then 'countryName'
      when 2 then 'stateName'
      when 3 then 'townName'
      when 4 then 'areaName'
      else null
    end
  )

Same query may be writed on the basis of CTE syntax (Oracle recursive subquery factoring).
Following is a variant for PostgreSQL CTE, corrected according to @Karthik_Murugan suggestion:

with RECURSIVE hierarchy_query(lvl, resource_id) as (
    select
      1               lvl, 
      rh.resource_id  resource_id
    from
      resource_hierarchy rh
    where
     (resource_name = 'countryName') and (parent_id is null) 

  union all

    select
      hq.lvl+1        lvl,
      rh.resource_id  resource_id
    from
      hierarchy_query    hq,
      resource_hierarchy rh
    where
      rh.parent_id = hq.resource_id
      and
      -- at each step get only required records
      resource_name = (
        case (hq.lvl + 1)
          when 2 then 'stateName'
          when 3 then 'townName'
          when 4 then 'areaName'
          else null
        end
      )
)
select
  hq.lvl, rh.*
from
  hierarchy_query    hq,
  resource_hierarchy rh
where
  rh.resource_id = hq.resource_id
order by
  hq.lvl

It's only half of the work because we need to help database engine to locate records by creating appropriate indexes.
Query above contains two search actions:
1. Locate records to start with;
2. Choose records on each next level.

For the first action, we need to index resource_name field and, possible, parent_id field.
For the second action fields parent_id and resource_name must be indexed.

create index X_RESOURCE_HIERARCHY_ROOT on RESOURCE_HIERARCHY (resource_name);
create index X_RESOURCE_HIERARCHY_TREE on RESOURCE_HIERARCHY (parent_id, resource_name);

Maybe creating only X_RESOURCE_HIERARCHY_TREE index is enough. It depends on characteristics of data stored in a table.

P.S. String for each level can be constructed from full path by using substr and instr functions like in this example for Oracle:

with prm as (
  select 
    '/countryName/stateName/townName/areaName/' location_path 
  from dual
)
select 
  substr(location_path,
    instr(location_path,'/',1,level)+1,
    instr(location_path,'/',1,level+1)-instr(location_path,'/',1,level)-1
  )          
from prm connect by level < 7

Very well explained. Only issue I have is that EnterpriseDB doesn't support multiple conditions in "CONNECT BY" clause. This is the reason I used bottom-up approach. — Karthik Murugan, Apr 19 '16 at 09:33
Is this restriction also applies to recursive queries built with CTE ([described here](http://www.postgresql.org/docs/current/static/queries-with.html))? I updated an answer with example of this syntax. — ThinkJet, Apr 19 '16 at 10:26
I ended up with a similar CTE based query after seeing our original response. Yes, its supported in EDB but slightly different syntax. Let me post it as a separate answer. You're amazing!! Thanks for your time! — Karthik Murugan, Apr 19 '16 at 10:29
Ok, your CTE query works in EDB after I added the keyword "recursive" after "WITH". So the query should start as, "WITH RECURSIVE hierarchy_query...". Can you edit your query? — Karthik Murugan, Apr 19 '16 at 10:33
And do you recommend same INDEXes for this CTE query as well? — Karthik Murugan, Apr 19 '16 at 10:34
1) Ok, `RECURSIVE` keyword added to a query. 2) Yes, indexes must be same for both variants, because database engine must perform same selections: get a first row(s) based on `resource_name` and get rows for each step based on `parent_id` and `resource_name`. — ThinkJet, Apr 19 '16 at 12:26

I_am_Batman · Answer 2 · 2016-04-18T10:39:38.747

1

select 
     LEVEL, 
     resource_id, 
     resource_type, 
     resource_name, 
     parent_id 
from   
     resource_hierarchy 
connect by prior parent_id = resource_id 
start with UPPER(resource_name)= UPPER(:resource_name);

Using this approach, you would not have to use the CASE statements. Just mentioning the resource Name would fetch the parent hierarchies.

edited Apr 18 '16 at 10:39

answered Apr 14 '16 at 17:19

I_am_Batman

895
9
21

I should have mentioned that resource_name entries are not unique. We need to locate the full hierarchy based on a URL like path. For e.g. countryName/stateName/townName/areaName. We dont need the hierarchy of any "areaName", but the one under townName, which is under stateName and which is under countryName – Karthik Murugan Apr 14 '16 at 17:30
We would be maintaining the mapping between parent_id and resource_id right? – I_am_Batman Apr 14 '16 at 17:33
exactly!! Entries are mapped by resource_id and parent_id. – Karthik Murugan Apr 14 '16 at 17:38
As I mentioned, the resource_name is not unique, but the combination is. This query will not give me the expected result, but many more results with identical "resource_name" in their leaf nodes – Karthik Murugan Apr 18 '16 at 07:36
Ok, let me try once again. – I_am_Batman Apr 18 '16 at 08:02
**EDIT :** Does this result set give you any hope that I am in right direction? select * from(select LEVEL, resource_id, resource_type, resource_name, parent_id ,SYS_CONNECT_BY_PATH(resource_name, '/') "Path", '/' || REVERSE(LTRIM(SYS_CONNECT_BY_PATH(REVERSE(resource_name), '/'), '/')) AS reversed_path from resource_hierarchy connect by prior parent_id = resource_id )where reversed_path = '/stateName/townName'; – I_am_Batman Apr 18 '16 at 08:06
There's a bug. will fix. – I_am_Batman Apr 18 '16 at 09:35

Karthik Murugan · Answer 3 · 2016-04-22T11:19:30.783

A slightly different query to what @ThinkJet came up with. This works in EDB and gives expected results.

WITH RECURSIVE rh (resource_id, resource_name, parent_id, level) AS 
(   
    SELECT resource_id, resource_name, parent_id, 1 as level FROM resource_hierarchy
    where resource_name = 'countryName' AND parent_id IS NULL
    UNION ALL
    SELECT cur.resource_id, cur.resource_name, cur.parent_id, level+1 FROM resource_hierarchy cur, rh prev WHERE cur.parent_id = prev.resource_id AND 
        cur.resource_name = (
                    CASE level 
                    WHEN 3 THEN 'areaName'
                    WHEN 2 THEN 'townName'
                    WHEN 1 THEN 'stateName'
                    END
                 )
)
SELECT * FROM rh

Edit: This query may match even partial matches, but we can always make sure that number of records = number of URL elements. Also if the URL has just one element (like /countryName), remove the UNION part from above query to get the expected result.

Connect by query

3 Answers3