If the only CPU has the memory bus locked, no other device can read or change memory contents during that time, not even via DMA. (Or with multiple CPUs on a shared bus with no cache, same deal.) Therefore, no other memory operations at all can happen between the load and the store of a lock add [di], ax for example, making it atomic wrt. any possible observer. (Other than a logic analyzer connected to the bus, which doesn't count.)
Semi-related: Can num++ be atomic for 'int num'? describes how the lock prefix works on modern CPUs for cacheable memory, providing RMW atomicity without a bus lock, just hanging on to the cache line for the duration.
We call this a "cache lock"; all modern CPUs work this way for aligned locked operations, only doing an expensive bus lock on something like xchg [mem], ax that spans a boundary between two cache-lines. That hurts throughput on all cores, and is so expensive that modern CPUs have a way to make that always fault, but not other unaligned loads/stores, as well as performance counters for it.
Fun fact: xchg [mem], reg has implicit lock semantics on 386 and newer. (Which is unfortunate because it makes it unusable for performance reasons as just a plain load/store when you're running low on registers). It didn't on 286 or earlier, unless you did lock xchg. This is possibly related to the fact that there were SMP 386 systems (with a primitive sequentially-consistent memory model). The modern x86 memory model applies to 486 and later SMP systems.