It's not decode that's the problem, it's usually fetch on 8086.
Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop instruction.  I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.
Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch.  (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).
dec cx is 1 byte, jnz rel8 is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8.
8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions.  (Except for very slow instructions like mul that would let the buffer fill up after at most three 2-byte fetches.)
See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.
https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck.  (i.e. just decoding from the prefetch buffer.)
Decode / execute timings, not including fetch:
DEC     Decrement
    operand     bytes   8088    186     286     386     486     Pentium
    r8           2       3       3       2       2       1       1   UV
    r16          1       3       3       2       2       1       1   UV
    r32          1       3       3       2       2       1       1   UV
    mem       2+d(0,2)  23+EA   15       7       6       3       3   UV
Jcc     Jump on condition code
    operand     bytes   8088    186     286     386     486     Pentium
    near8        2      4/16    4/13    3/7+m   3/7+m   1/3     1    PV
    near16       3       -       -       -      3/7+m   1/3     1    PV
LOOP    Loop control with CX counter
      operand   bytes   8088    186     286     386     486     Pentium
      short      2      5/17    5/15    4/8+m   11+m    6/7     5/6  NP
So even ignoring code-fetch differences:
- dec+ taken- jnztakes 3 + 16 = 19 cycles to decode / exec on 8086 / 8088.
- taken looptakes 17 cycles to decode / exec on 8086 / 8088.
(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction.  IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)
8088/8086 are not pipelined except for the code-prefetch buffer.  Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg / shift / rotate / stc/std / etc.) take 2 cycles.  Bizarrely more than nop (3 cycles).