First of all, your biggest problem is that the value you're looking for in b is NOT the same as a.  x86 is little-endian, memcpy from a to b (or any other byte-at-a-time copy without byte-swapping) would actually produce:
a   db 12h, 34h, 56h, 78h,   9Ah, 0,0,0  ; added padding 
b   dd 78563412h,            0000009Ah
Your b  dd 12345678h, 9A000000h  has the first dword endian-swapped, and the 5th byte of a as the MSB of the 2nd dword in b, not the LSB.
Copying 5 bytes from a to b leaves the last 3 bytes of b uninitialized.  (In Unix, .bss space is zero-initialized.  I assume this happens for dup(?) space in MASM/TASM, but if not, whatever garbage was there before will still be there.)
If you copy 8 bytes from a to b, the three bytes after the 9A will be read from the start of b if they end up in the same section (rather than b going into bss.  Perhaps this is why you used an org directive to separate them in your answer.
If you don't have any special reason to want to copy a dword all at once, then in 8086 code you should just use rep movsw, or normal mov instructions, like 
mov   ax, [a]          ; If your addresses are static, might as well just use
mov   dx, [a+2]        ; absolute addressing, esp in 16bit code where it's only 2B
mov   [b], ax
mov   [b+2], dx
Note that your loop with si and di only increments them by 1, but you load/store two bytes.  Unaligned overlapping loads/stores work, but you're doing redundant work.
For your case, you have 5 bytes to copy.  You could use rep movsb with cx=5.  8086 of course doesn't support movsd or movsq, and rep startup overhead makes it inefficient for small copies.
If you do care about doing both loads at once, e.g. from a dword that an interrupt handler can modify:
On a single-core CPU, we don't have to worry about memory being modified by other concurrent threads.  However, an interrupt (maybe triggering a context-switch to another thread) could arrive between any two instructions, but not in the middle of a single instruction.  (This is the big difference between single-core atomicity and multi-core: on a multi-core).
So, if you're loading a dword that can be modified asynchronously (e.g. by an interrupt handler), and you want to load both halves of it at once, you need to get both halves with a single instruction.
Do not use this if you're just writing normal single-threaded programs without interrupt handlers.
One way is with Sep Roland's les trick (see his answer), but that leaves ES temporarily set to something weird, which might be a problem depending on your interrupt handler.
Another way uses the x87 FPU (not guaranteed to exist on 8086), but you can use it to copy in 32 or 64-bit chunks.  e.g.
fild   dword ptr [a]    ; load 32bits as an integer
fistp  dword ptr [d]    ; store as the same integer
; also works with qword ptr
; or store to the stack and then load into dx:ax with two mov instructions
; your own stack memory is private, so you don't need atomic ops there
x87's internal 80-bit FP format can exactly represent every 64-bit integer, so this works on any possible bit-pattern.  (fld/fstp wouldn't, because fld requires a valid IEEE double-precision floating point representation, unlike fild.)
Even on 8086, it will be atomic with respect to interrupts.  fild dword is atomic for aligned loads on 486 and later hardware.
gcc actually uses this to implement C++11 std::atomic<uint64_t> loads/stores in 32-bit mode (since the ISA guarantees that naturally-aligned loads/stores of 64-bit and smaller values are atomic, on P5 and later).
gcc used to bounce std::atomic<double> values around with fild/fstp when SSE2 wasn't available, but that was fixed after I reported it.  (I noticed the issue while answering Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs)
See Agner Fog's Optimizing Assembly guide for other useful tricks.  (And also the x86 tag wiki).