This is a silly missed optimization by ICC.  It's not specific to AVX512; it still happens with default/generic arch settings.
lea    ecx, DWORD PTR [16+rax] is computing i+16 as part of the unroll, with truncation to 32-bit (32-bit operand-size) and zero-extension to 64-bit (implicit in x86-64 when writing a 32-bit register).  This explicitly implements the semantics of unsigned wrap-around at the type width.
gcc and clang have no problem proving that unsigned i won't wrap, so they can optimize away the zero-extension from 32-bit unsigned to 64-bit pointer-width for use in an addressing mode, because the loop upper bound is known1.
Recall that unsigned wrap-around is well-defined in C and C++, but signed-overflow is undefined behaviour.  That means that signed variables can be promoted to pointer width, and that the compiler doesn't have to redo sign-extension to pointer width every time they're used as an array index.  (a[i] is equivalent to *(a+i), and the rules for adding integers to pointers mean that sign-extension is necessary for narrow values where the upper bits of the register might not match.)
Signed-overflow UB is why ICC is able to optimize properly for a signed counter even though it fails to use range info.  See also http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html (about undefined behaviour).  Notice that it's using add rax, 64 and cmp with 64-bit operand-size (RAX instead of EAX)
I made your code into a MCVE to test with other compilers.  __assume_aligned is ICC-only, so I used the GNU C __builtin_assume_aligned.
#define COUNTER_TYPE unsigned
double sum(const double *a) {
    a = __builtin_assume_aligned(a, 64);
    double s = 0.0;
    for ( COUNTER_TYPE i = 0; i < 1024*1024; i++ )
        s += a[i];
    return s;
}
clang compiles your function like this (Godbolt compiler explorer):
# clang 7.0 -O3
sum:                                    # @sum
    xorpd   xmm0, xmm0
    xor     eax, eax
    xorpd   xmm1, xmm1
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
    addpd   xmm0, xmmword ptr [rdi + 8*rax]
    addpd   xmm1, xmmword ptr [rdi + 8*rax + 16]
    addpd   xmm0, xmmword ptr [rdi + 8*rax + 32]
    addpd   xmm1, xmmword ptr [rdi + 8*rax + 48]
    addpd   xmm0, xmmword ptr [rdi + 8*rax + 64]
    addpd   xmm1, xmmword ptr [rdi + 8*rax + 80]
    addpd   xmm0, xmmword ptr [rdi + 8*rax + 96]
    addpd   xmm1, xmmword ptr [rdi + 8*rax + 112]
    add     rax, 16                                  # 64-bit loop counter
    cmp     rax, 1048576
    jne     .LBB0_1
    addpd   xmm1, xmm0
    movapd  xmm0, xmm1         # horizontal sum
    movhlps xmm0, xmm1              # xmm0 = xmm1[1],xmm0[1]
    addpd   xmm0, xmm1
    ret
I didn't enable AVX, that doesn't change the loop structure.  Note that clang only uses 2 vector accumulators, so it will bottleneck on FP add latency on most recent CPUs, if data is hot in L1d cache.  Skylake can keep up to 8 addpd in flight at once (2 per clock throughput with 4 cycle latency).  So ICC does a much better job for cases where (some of) the data is hot in L2 or especially L1d cache.
It's strange that clang didn't use a pointer-increment, if it's going to add/cmp anyway.  It would only take a couple extra instructions ahead of the loop, and would simplify the addressing modes allowing micro-fusion of the load even on Sandybridge.  (But it's not AVX, so Haswell and later can keep the load micro-fused.  Micro fusion and addressing modes).  GCC does that, but doesn't unroll at all, which is GCC's default without profile-guided optimization.
Anyway, ICC's AVX512 code will un-laminate into separate load and add uops in the issue/rename stage (or before being added to the IDQ, I'm not sure).  So it's pretty silly that it doesn't use a pointer increment to save front-end bandwidth, consume less ROB space for a larger out-of-order window, and be more hyperthreading-friendly.
Footnote 1:
(And even if it wasn't, an infinite loop with no side effects like a volatile or atomic access is undefined behaviour, so even with i <= n with a runtime-variable n, the compiler would be allowed to assume the loop wasn't infinite and thus i didn't wrap.  Is while(1); undefined behavior in C?)
In practice gcc and clang don't take advantage of this, and make a loop that actually is potentially infinite, and don't auto-vectorize because of that possible weirdness.  So avoid i <= n with runtime variable n, especially for unsigned compares.  Use i < n instead.
If unrolling, i += 2 can have a similar effect.
So doing the end-pointer and pointer-increment in the source is often good, because that's often optimal for the asm.