When I build this example function:
unsigned summer( unsigned accum, int * ptr, unsigned N )
{
    for(unsigned i = 0; i < N; ++i )
    {
        accum += *ptr++;
    }
    return accum;
}
With compiler-explorer's ARM gcc 8.5(linux), CFLAGS="-O3 -Wall -Wextra -mcpu=cortex-m4 -falign-loops=4", at first I don't see evidence of loop alignment:
summer(unsigned int, int*, unsigned int):
    cbz     r2, .L9
    push    {r4}
    movs    r3, #0
.L3:
    ldr     r4, [r1], #4
    adds    r3, r3, #1
    cmp     r2, r3
    add     r0, r0, r4
    bne     .L3
    pop     {r4}
    bx      lr
.L9:
    bx      lr
After unchecking "Filter->Directives" I see a lot more, here's just the function with unrelated directives removed by hand:
summer(unsigned int, int*, unsigned int):
    cbz     r2, .L9
    push    {r4}
    movs    r3, #0
.LVL1:
    .p2align 2 #Align instructions to 2(number) to the power of 2(because .p2align)
.L3:
    ldr     r4, [r1], #4
    adds    r3, r3, #1
    cmp     r2, r3
    add     r0, r0, r4
    bne     .L3
    pop     {r4}
    bx      lr
.L9:
    bx      lr
But we don't really see the effect of .p2align yet. Re-enabling Filter->Directives and also checking Output->Compile to binary object" we see the additional inserted NOP that's added with -falign-loops=4:
summer(unsigned int, int*, unsigned int):
    cbz r2, 18 <summer(unsigned int, int*, unsigned int)+0x18>
    push    {r4}
    movs    r3, #0
    nop
    ldr.w   r4, [r1], #4
    adds    r3, #1
    cmp r2, r3
    add r0, r4
    bne.n   8 <summer(unsigned int, int*, unsigned int)+0x8>
    pop {r4}
    bx  lr
    bx  lr
    nop
Now that we see what it is, could we improve it? Perhaps some cores would prefer we combine "movs   r3, #0" and "nop" into a single 32-bit wide instruction "movs.w r3,#0". Currently the NOP only applies once per function call, rather than the misaligned 32-bit instruction penalty per loop iteration.