This is the optimal solution, the AND would require at least two more instructions possibly having to stop and wait for a load to happen of the value to mask.  So worse in a couple of ways.  
00000000 <swap>:
   0:   e1a03420    lsr r3, r0, #8
   4:   e1830400    orr r0, r3, r0, lsl #8
   8:   e1a00800    lsl r0, r0, #16
   c:   e1a00820    lsr r0, r0, #16
  10:   e12fff1e    bx  lr
00000000 <swap>:
   0:   ba40        rev16   r0, r0
   2:   b280        uxth    r0, r0
   4:   4770        bx  lr
The latter is armv7 but at the same time it is because they added instructions to support this kind of work.
Fixed length RISC instructions have by definition a problem with constants.  MIPS chose one way, ARM chose another.  Constants are a problem on CISC as well just a different problem.  Not difficult to create something that takes advantage of ARMS barrel shifter and shows a disadvantage of MIPS solution and vice versa.
The solution actually has a bit of elegance to it.
Part of this as well is the overall design of the target.
unsigned short fun ( unsigned short x )
{
    return(x+1);
}
0000000000000010 <fun>:
  10:   8d 47 01                lea    0x1(%rdi),%eax
  13:   c3                      retq   
gcc chooses not to return the 16 bit variable you asked for it returns a 32 bit, it doesnt properly/correctly implement the function I asked for with my code.  But that is okay if when the user of the data gets that result or uses it the mask happens there or with this architecture ax is used instead of eax. for example.
unsigned short fun ( unsigned short x )
{
    return(x+1);
}
unsigned int fun2 ( unsigned short x )
{
    return(fun(x));
}
0000000000000010 <fun>:
  10:   8d 47 01                lea    0x1(%rdi),%eax
  13:   c3                      retq   
0000000000000020 <fun2>:
  20:   8d 47 01                lea    0x1(%rdi),%eax
  23:   0f b7 c0                movzwl %ax,%eax
  26:   c3                      retq   
A compiler design choice (likely based on architecture) not an implementation bug.
Note that for a sufficiently sized project, it is easy to find missed optimization opportunities.  No reason to expect an optimizer to be perfect (it isnt and cant be).  They just need to be more efficient than a human doing it by hand for that sized project on average.  
This is why it is commonly said that for performance tuning you dont pre-optimize or just jump to asm immediately you use the high level language and the compiler you in some way profile your way through to find the performance problems then hand code those, why hand code them because we know we can at times out perform the compiler, implying the compiler output can be improved upon.
This isnt a missed optimization opportunity, this is instead a very elegant solution for the instruction set.  Masking a byte is simpler
unsigned char fun ( unsigned char x )
{
    return((x<<4)|(x>>4));
}
00000000 <fun>:
   0:   e1a03220    lsr r3, r0, #4
   4:   e1830200    orr r0, r3, r0, lsl #4
   8:   e20000ff    and r0, r0, #255    ; 0xff
   c:   e12fff1e    bx  lr
00000000 <fun>:
   0:   e1a03220    lsr r3, r0, #4
   4:   e1830200    orr r0, r3, r0, lsl #4
   8:   e6ef0070    uxtb    r0, r0
   c:   e12fff1e    bx  lr
the latter being armv7, but with armv7 they recognized and solved these issues you cant expect the programmer to always use natural sized variables, some feel the need to use less optimal sized variables.  sometimes you still have to mask to a certain size.