Use movzx to load narrow data on modern CPUs.  (Or movsx if it's useful to have it sign-extended instead of zero-extended, but movzx is sometimes faster and never slower.)
movzx is only slow on the ancient P5 (original Pentium) microarchitecture, not anything made this century.  Pentium-branded CPUs based on recent microarchitectures, like Pentium G3258 (Haswell, 20th anniversary edition of original Pentium) are totally different beasts, and perform like the equivalent i3 but without AVX, BMI1/2, or hyperthreading.
Don't tune modern code based on P5 guidelines / numbers.  However, Knight's Corner (Xeon Phi) is based on a modified P54C microarchitecture, so perhaps it has slow movzx as well.  Neither Agner Fog nor Instlatx64 have per-instruction throughput / latency numbers for KNC.
Using a 16-bit operand size instruction doesn't switch the whole pipeline over to 16-bit mode or cause a big perf hit.  See Agner Fog's microarch pdf to learn exactly what is and isn't slow on various x86 CPU microarchitectures (including ones as old as Intel P5 (original Pentium) which you seem to be talking about for some reason).
Writing a 16-bit register and then reading the full 32/64-bit register is slow on some CPU (partial-register stall when merging on Intel P6-family).  On others, writing a 16-bit register merges into the old value so there's a false dependency on the old value of the full register when you write, even if you never read the full register.  See which CPU does what. (Note that Haswell/Skylake only rename AH separately, unlike Sandybridge which (like Core2/Nehalem) also renames AL / AX separately from RAX, but merges without stalling.)
Unless you specifically care about in-order P5 (or possibly Knight's Corner Xeon Phi, based on the same core, but IDK if movzx is slow there, too), USE THIS:
movzx   eax, word [src1]        ; as efficient as a 32-bit MOV load on most CPUs
cmp      ax, word [src2]
The operand-size prefix for cmp decodes efficiently on all modern CPUs.  Reading a 16-bit register after writing the full register is always fine, and the 16-bit load for the other operand is also fine.
The operand-size prefix isn't length-changing because there's no imm16 / imm32.  e.g. cmp word [src2], 0x7F is fine (it can use a sign-extended imm8), but
cmp word [src2], 0x80 needs an imm16 and will LCP-stall on some Intel CPUs.  (Without the operand-size prefix, the same opcode would have an imm32, i.e. the rest of the instruction would be a different length).  Instead, use mov eax, 0x80 / cmp word [src2], ax.
The address-size prefix can be length-changing in 32-bit mode (disp32 vs. disp16), but we don't want to use 16-bit addressing modes to access 16-bit data.  We're still using [ebx+1234] (or rbx), not [bx+1234].
On modern x86: Intel P6 / SnB-family / Atom / Silvermont, AMD since at least K7, i.e. anything made in this century, newer than actual P5 Pentium, movzx loads are very efficient.
On many CPUs, the load ports directly support movzx (and sometimes also movsx), so it runs as just a load uop, not as a load + ALU.
Data from Agner Fog's instruction-set tables:  Note they may not cover every corner case, e.g. mov-load numbers might only be for 32 / 64-bit loads.  Also note that Agner Fog's load latency numbers are not load-use latency from L1D cache; they only make sense as part of the store/reload (store-forwarding) latency, but relative numbers will tell us how many cycles movzx adds on top of mov (often no extra cycles).
(Update: https://uops.info/ has better test results that actually reflect load-use latency, and they're automated so typos and clerical errors in updating the spreadsheets aren't a problem.  But uops.info only goes back to Conroe (first-gen Core 2) for Intel, and only Zen for AMD.)
- P5 Pentium (in-order execution): - movzx-load is a 3-cycle instruction (plus a decode bottleneck from the- 0Fprefix), vs.- mov-loads being single cycle throughput. (They still have latency, though).
 
- Intel: 
- PPro / Pentium II / III: - movzx/- movsxrun on just a load port, same throughput as plain- mov.
 
- Core2 / Nehalem: same, including 64-bit - movsxd, except on Core 2 where a- movsxd r64, m32load costs a load + ALU uop, which don't micro-fuse.
 
- Sandybridge-family (SnB through Skylake and later): - movzx/- movsxloads are single-uop (just a load port), and perform identically to- movloads.
 
- Pentium4 (netburst): - movzxruns on the load port only, same perf as- mov.- movsxis load + ALU, and takes 1 extra cycle.
 
- Atom (in-order): Agner's table is unclear for memory-source - movzx/- movsxneeding an ALU, but they're definitely fast.  The latency number is only for reg,reg.
 
- Silvermont: same as Atom: fast but unclear on needing a port. 
- KNL (based on Silvermont): Agner lists - movzx/- movsxwith a memory source as using IP0 (ALU), but latency is the same as- mov r,mso there's no penalty.  (execution-unit pressure is not a problem because KNL's decoders can barely keep its 2 ALUs fed anyway.)
 
- AMD: 
- Bobcat: - movzx/- movsxloads are 1 per clock, 5 cycle latency.- mov-load is 4c latency.
 
- Jaguar: - movzx/- movsxloads are 1 per clock, 4 cycle latency.- movloads are 1 per clock, 3c latency for 32/64-bit, or 4c for- mov r8/r16, m(but still only an AGU port, not an ALU merge like Haswell/Skylake do).
 
- K7/K8/K10: - movzx/- movsxloads have 2-per-clock throughput, latency 1 cycle higher than a- movload.  They use an AGU and an ALU.
 
- Bulldozer-family: same as K10, but - movsx-load has 5 cycle latency.- movzx-load has 4 cycle latency,- mov-load has 3 cycle latency.  So in theory it might be lower latency to- mov cx, word [mem]and then- movsx eax, cx(1 cycle), if the false dependency from a 16-bit- movload doesn't require an extra ALU merge, or create a loop-carried dependency for your loop.
 
- Ryzen: - movzx/- movsxloads run in the load port only, same latency as- movloads.
 
- VIA 
- Via Nano 2000/3000: - movzxruns on the load port only, same latency as- movloads.- movsxis LD + ALU, with 1c extra latency.
 
When I say "perform identically", I mean not counting any partial-register penalties or cache-line splits from a wider load.  e.g. a movzx eax, word [rsi] avoids a merging penalty vs mov ax, word [rsi] on Skylake, but I'll still say that mov performs identically to movzx.  (I guess I mean that mov eax, dword [rsi] without any cache-line splits is as fast as movzx eax, word [rsi].)
xor-zeroing the full register before writing a 16-bit register avoids a later partial-register merging stall on Intel P6-family, as well as breaking false dependencies.
If you want to run well on P5 as well, this might be somewhat better there while not being much worse on any modern CPUs except PPro to PIII where xor-zeroing isn't dep-breaking, even though it is still recognized as a zeroing-idiom making EAX equivalent to AX (no partial-register stall when reading EAX after writing AL or AX).
;; Probably not a good idea, maybe not faster on anything.
;mov  eax, 0             ; some code tuned for PIII used *both* this and xor-zeroing.
xor   eax, eax           ; *not* dep-breaking on early P6 (up to PIII)
mov    ax, word [src1]
cmp    ax, word [src2]
; safe to read EAX without partial-reg stalls
The operand-size prefix isn't ideal for P5, so you could consider using a 32-bit load if you're sure it doesn't fault, cross a cache-line boundary, or cause a store-forwarding failure from a recent 16-bit store.
Actually, I think a 16-bit mov load might be slower on Pentium than the movzx/cmp 2 instruction sequence.  There really doesn't seem to be a good option for working with 16-bit data as efficiently as 32-bit!  (Other than packed MMX stuff, of course).
See Agner Fog's guide for the Pentium details, but the operand-size prefix takes an extra 2 cycles to decode on P1 (original P5) and PMMX, so this sequence may actually be worse than a movzx load.  On P1 (but not PMMX), the 0F escape byte (used by movzx) also counts as a prefix, taking an extra cycle to decode.
Apparently movzx isn't pairable anyway.  Multi-cycle movzx will hide the decode latency of cmp ax, [src2], so movzx / cmp is probably still the best choice.  Or schedule instructions so the movzx is done earlier and the cmp can maybe pair with something.  Anyway, the scheduling rules are quite complicated for P1/PMMX.
I timed this loop on Core2 (Conroe) to prove that xor-zeroing avoids partial register stalls for 16-bit registers as well as low-8 (like for setcc al):
mov     ebp, 100000000
ALIGN 32
.loop:
%rep 4
    xor   eax, eax
;    mov   eax, 1234    ; just break dep on the old value, not a zeroing idiom
    mov   ax, cx        ; write AX
    mov   edx, eax      ; read EAX
%endrep
    dec   ebp           ; Core2 can't fuse dec / jcc even in 32-bit mode
    jg   .loop          ; but SnB does
perf stat -r4 ./testloop output for this in a static binary that makes a sys_exit system call after :
 ;; Core2 (Conroe) with   XOR eax, eax
       469,277,071      cycles                    #    2.396 GHz
     1,400,878,601      instructions              #    2.98  insns per cycle
       100,156,594      branches                  #  511.462 M/sec
             9,624      branch-misses             #    0.01% of all branches
       0.196930345 seconds time elapsed                                          ( +-  0.23% )
2.98 instructions per cycle makes sense: 3 ALU ports, all instructions are ALU, and there's no macro-fusion, so each is 1 uop.  So we're running at 3/4 of the front-end capacity.  The loop has 3*4 + 2 instructions / uops.
Things are very different on Core2 with the xor-zeroing commented and using the mov eax, imm32 instead:
 ;; Core2 (Conroe) with   MOV eax, 1234
 1,553,478,677      cycles                    #    2.392 GHz
 1,401,444,906      instructions              #    0.90  insns per cycle
   100,263,580      branches                  #  154.364 M/sec
        15,769      branch-misses             #    0.02% of all branches
   0.653634874 seconds time elapsed                                          ( +-  0.19% )
0.9 IPC (down from 3) is consistent with the front-end stalling for 2 to 3 cycles to insert a merging uop on every mov   edx, eax.
Skylake runs both loops identically, because mov eax,imm32 is still dependency-breaking.  (Like most instructions with a write-only destination, but beware of false dependencies from popcnt and lzcnt/tzcnt).
Actually, the uops_executed.thread perf counter does show a difference: on SnB-family, xor-zeroing doesn't take an execution unit because it's handled in the issue/rename stage.  (mov    edx,eax is also eliminated at rename, so the uop count is actually quite low).  The cycle count is the same to within less than 1% either way.
 ;;; Skylake (i7-6700k) with xor-zeroing
 Performance counter stats for './testloop' (4 runs):
         84.257964      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.21% )
                 0      context-switches          #    0.006 K/sec                    ( +- 57.74% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 3      page-faults               #    0.036 K/sec                  
       328,337,097      cycles                    #    3.897 GHz                      ( +-  0.21% )
       100,034,686      branches                  # 1187.243 M/sec                    ( +-  0.00% )
     1,400,195,109      instructions              #    4.26  insn per cycle           ( +-  0.00% )  ## dec/jg fuses into 1 uop
     1,300,325,848      uops_issued_any           # 15432.676 M/sec                   ( +-  0.00% )    ###   fused-domain
       500,323,306      uops_executed_thread      # 5937.994 M/sec                    ( +-  0.00% )    ### unfused-domain
                 0      lsd_uops                  #    0.000 K/sec                  
       0.084390201 seconds time elapsed                                          ( +-  0.22% )
lsd.uops is zero because the loop buffer is disabled by a microcode update.  This bottlenecks on the front-end: uops (fused-domain) / clock = 3.960 (out of 4).  That last .04 might be partly OS overhead (interrupts and so on), because this is only counting user-space uops.