You are testing with way too small an input; while a dictionary comprehension doesn't have as much of a performance advantage against a for loop when compared to a list comprehension, for realistic problem sizes it can and does beat for loops, especially when targeting a global name.
Your input consists of just 3 key-value pairs. Testing with 1000 elements instead, we see that the timings are very close instead:
>>> import timeit
>>> from random import choice, randint; from string import ascii_lowercase as letters
>>> looped = '''\
... b = {}
... for i,j in a.items():
...     b[j]=i
... '''
>>> dictcomp = '''b = {v: k for k, v in a.items()}'''
>>> def rs(): return ''.join([choice(letters) for _ in range(randint(3, 15))])
...
>>> a = {rs(): rs() for _ in range(1000)}
>>> len(a)
1000
>>> count, total = timeit.Timer(looped, 'from __main__ import a').autorange()
>>> (total / count) * 1000000   # microseconds per run
66.62004760000855
>>> count, total = timeit.Timer(dictcomp, 'from __main__ import a').autorange()
>>> (total / count) * 1000000   # microseconds per run
64.5464928005822
The difference is there, the dict comp is faster but only just at this scale. With 100 times as many key-value pairs the difference is a bit bigger:
>>> a = {rs(): rs() for _ in range(100000)}
>>> len(a)
98476
>>> count, total = timeit.Timer(looped, 'from __main__ import a').autorange()
>>> total / count * 1000  # milliseconds, different scale!
15.48140200029593
>>> count, total = timeit.Timer(dictcomp, 'from __main__ import a').autorange()
>>> total / count * 1000  # milliseconds, different scale!
13.674790799996117
which is not that big a difference when you consider both processed nearly 100k key-value pairs. Still, the for loop is clearly slower.
So why the speed difference with 3 elements? Because a comprehension (dictionary, set, list comprehensions or a generator expression) is under the hood implemented as a new function, and calling that function has a base cost the plain loop doesn't have to pay.
Here's the disassembly for the bytecode for both alternatives; note the MAKE_FUNCTION and CALL_FUNCTION opcodes in the top-level bytecode for the dict comprehension, there is a separate section for what that function then does, and there are actually very few differences in between the two approaches here:
>>> import dis
>>> dis.dis(looped)
  1           0 BUILD_MAP                0
              2 STORE_NAME               0 (b)
  2           4 SETUP_LOOP              28 (to 34)
              6 LOAD_NAME                1 (a)
              8 LOAD_METHOD              2 (items)
             10 CALL_METHOD              0
             12 GET_ITER
        >>   14 FOR_ITER                16 (to 32)
             16 UNPACK_SEQUENCE          2
             18 STORE_NAME               3 (i)
             20 STORE_NAME               4 (j)
  3          22 LOAD_NAME                3 (i)
             24 LOAD_NAME                0 (b)
             26 LOAD_NAME                4 (j)
             28 STORE_SUBSCR
             30 JUMP_ABSOLUTE           14
        >>   32 POP_BLOCK
        >>   34 LOAD_CONST               0 (None)
             36 RETURN_VALUE
>>> dis.dis(dictcomp)
  1           0 LOAD_CONST               0 (<code object <dictcomp> at 0x11d6ade40, file "<dis>", line 1>)
              2 LOAD_CONST               1 ('<dictcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_NAME                0 (a)
              8 LOAD_METHOD              1 (items)
             10 CALL_METHOD              0
             12 GET_ITER
             14 CALL_FUNCTION            1
             16 STORE_NAME               2 (b)
             18 LOAD_CONST               2 (None)
             20 RETURN_VALUE
Disassembly of <code object <dictcomp> at 0x11d6ade40, file "<dis>", line 1>:
  1           0 BUILD_MAP                0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                14 (to 20)
              6 UNPACK_SEQUENCE          2
              8 STORE_FAST               1 (k)
             10 STORE_FAST               2 (v)
             12 LOAD_FAST                1 (k)
             14 LOAD_FAST                2 (v)
             16 MAP_ADD                  2
             18 JUMP_ABSOLUTE            4
        >>   20 RETURN_VALUE
The material differences: the looped code uses LOAD_NAME for b each iteration, and STORE_SUBSCR to store the key-value pair in dict loaded. The dictionary comprehension uses MAP_ADD to achieve the same thing as STORE_SUBSCR but doesn't have to load that b name each time. 
But with 3 iterations only, the MAKE_FUNCTION / CALL_FUNCTION combo the dict comprehension has to execute is the real drag on the performance:
>>> make_and_call = '(lambda i: None)(None)'
>>> dis.dis(make_and_call)
  1           0 LOAD_CONST               0 (<code object <lambda> at 0x11d6ab270, file "<dis>", line 1>)
              2 LOAD_CONST               1 ('<lambda>')
              4 MAKE_FUNCTION            0
              6 LOAD_CONST               2 (None)
              8 CALL_FUNCTION            1
             10 RETURN_VALUE
Disassembly of <code object <lambda> at 0x11d6ab270, file "<dis>", line 1>:
  1           0 LOAD_CONST               0 (None)
              2 RETURN_VALUE
>>> count, total = timeit.Timer(make_and_call).autorange()
>>> total / count * 1000000
0.12945385499915574
More than 0.1 μs to create a function object with one argument, and call it (with an extra LOAD_CONST for the None value we pass in)! And that's just about the difference between the looped and comprehension timings for 3 key-value pairs.
You can liken this to being surprised that a man with a shovel can dig a small hole faster than a backhoe can. The backhoe can certainly dig fast, but a man with a shovel can get started faster if you need to get the backhoe started and moved into position first!
Beyond a few key-value pairs (digging a bigger hole), the function create and call cost fades away into nothingness. At this point the dict comprehension and the explicit loop basically do the same thing:
- take the next key-value pair, pop those on the stack
- call the dict.__setitem__hook via a bytecode operation with the top two items on the stack (eitherSTORE_SUBSCRorMAP_ADD. This doesn't count as a 'function call' as it's all internally handled in the interpreter loop.
This is different from a list comprehension, where the plain loop version would have to use list.append(), involving an attribute lookup, and a function call each loop iteration. The list comprehension speed advantage comes from this difference; see Python list comprehension expensive
What a dict comprehension does add, is that the target dictionary name only needs to be looked up once, when binding b to the the final dictionary object. If the target dictionary is a global instead of a local variable, the comprehension wins, hands down:
>>> a = {rs(): rs() for _ in range(1000)}
>>> len(a)
1000
>>> namespace = {}
>>> count, total = timeit.Timer(looped, 'from __main__ import a; global b', globals=namespace).autorange()
>>> (total / count) * 1000000
76.72348440100905
>>> count, total = timeit.Timer(dictcomp, 'from __main__ import a; global b', globals=namespace).autorange()
>>> (total / count) * 1000000
64.72114819916897
>>> len(namespace['b'])
1000
So just use a dict comprehension. The difference with < 30 elements to process is too small to care about, and the moment you are generating a global or have more items, the dict comprehension wins out anyway.