The key bottleneck is the load-use latency of cache misses while striding through tex[], as part of the dependency chain from old i to new i.  Lookups from the small rand_array[] will hit in cache and just add about 5 cycles of L1d load-use latency to the loop-carried dependency chain, making it even more unlikely for anything other than latency through i to be relevant to the overall speed of the loop.  (Jérôme Richard's answer on your previous question discusses that latency dep chain and the random stride increment.)
memcmp(tex+i, str, SLen) might be acting as prefetching if tex+i is near the end of a cache line.  If you compiled with -O0, you're calling glibc memcmp so the SSE 16-byte or AVX2 32-byte load done in glibc memcmp crosses a cache-line boundary and pulls in the next line.  (IIRC, glibc memcmp checks and avoids page crossing, but not cache-line splits, because those are relatively cheap on modern CPUs.  Single-step into it to see.  Into the 2nd call to avoid lazy dynamic linking, or compile with -fno-plt).
Possibly touching some cache lines that would have been skipped by the random stride helps HW prefetch be more aggressive and detect it as a sequential access pattern.  A 32-byte AVX load is half a cache line, so it's fairly likely to cross into the next one.
So a more-predictable sequence of cache-line requests seen by the L2 cache is a plausible explanation for the memcmp version being faster.  (Intel CPUs put the main prefetchers in the L2.)
If you compile with optimization, that 10-byte memcmp inlines and in the inner loop only touches 8 bytes.  (If they match, then it jumps to another block where it checks the last 2.  But it's exceedingly unlikely that ever happens).
This might be why -O0 is faster than -O3 on my system, with GCC11.1 on i7-6700k Skylake with DDR4-2666 DRAM, on Arch GNU/Linux.  (Your question didn't bother to specify which compiler and options your numbers were from, or which hardware.)
$ gcc -O3 array-stride.c
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out 
No.1: 0.041710 s
No.2: 0.063072 s
No.1: 0.040457 s
No.2: 0.061166 s
 Performance counter stats for './a.out' (2 runs):
          1,843.71 msec task-clock                #    0.999 CPUs utilized            ( +-  0.14% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            73,300      page-faults               #   39.757 K/sec                    ( +-  0.00% )
     6,607,078,798      cycles                    #    3.584 GHz                      ( +-  0.06% )
    18,614,303,386      instructions              #    2.82  insn per cycle           ( +-  0.00% )
    19,407,689,229      uops_issued.any           #   10.526 G/sec                    ( +-  0.02% )
    21,970,261,576      uops_executed.thread      #   11.916 G/sec                    ( +-  0.02% )
     5,408,090,087      mem_load_retired.l1_hit   #    2.933 G/sec                    ( +-  0.00% )
         3,861,687      mem_load_retired.l1_miss  #    2.095 M/sec                    ( +-  2.11% )
           1.84537 +- 0.00135 seconds time elapsed  ( +-  0.07% )
$ grep . /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance
  ... same on all 8 logical cores
The perf counter data is basically meaningless; it's for the whole run time, not the timed regions (which are about 0.1 sec out of 1.8 sec), so the majority of the time spent here was in glibc random(3) which uses "a nonlinear additive feedback random number generator employing a default table of size 31 long integers".  Also in page-faults on the big malloc region.
It's only interesting in terms of the delta between two builds, but the loops outside the timed region still contribute many extra uops so it's still not as interesting as one would like.
vs. gcc -O0: No.1 is faster, No.2 is as expected a bit slower, with -O0 putting a store/reload into the dependency chain involving i.
$ gcc -O0 array-stride.c
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out 
No.1: 0.028402 s
No.2: 0.076405 s
No.1: 0.028079 s
No.2: 0.069492 s
 Performance counter stats for './a.out' (2 runs):
          1,979.57 msec task-clock                #    0.999 CPUs utilized            ( +-  0.04% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
            66,656      page-faults               #   33.672 K/sec                    ( +-  0.00% )
     7,252,728,414      cycles                    #    3.664 GHz                      ( +-  0.02% )
    20,507,166,672      instructions              #    2.83  insn per cycle           ( +-  0.01% )
    22,268,130,378      uops_issued.any           #   11.249 G/sec                    ( +-  0.00% )
    25,117,638,171      uops_executed.thread      #   12.688 G/sec                    ( +-  0.00% )
     6,640,523,801      mem_load_retired.l1_hit   #    3.355 G/sec                    ( +-  0.01% )
         3,350,518      mem_load_retired.l1_miss  #    1.693 M/sec                    ( +-  1.39% )
         1.9810591 +- 0.0000934 seconds time elapsed  ( +-  0.00% )
Remember this profiling is for the total run time, not the timed regions, so the 2.83 IPC is not for the timed region.
Out of order exec hiding cost of independent work
The actual throughput cost of running that macro-fused cmp/je uop doesn't add to any bottleneck, thanks to out-of-order exec.  Or even a whole call memcmp@plt and setting up the args.  The bottleneck is latency, not the front-end, or load ports in the back end, and the OoO exec window is deep enough to hide the memcmp work.  See also the following to understand modern CPUs.  (Yes, it will take a lot of reading to wrap your head around them.)
The loop-carried dependency through i goes i -> tex[i] -> indirectly back into i for next iteration, monotonically increasing it.  tex[] is too big to fit in cache, and HW prefetch won't keep up with those strides.
So DRAM bandwidth might be the limiting factor, or even DRAM latency if HW prefetch doesn't detect and lock on to the somewhat-sequential access pattern to consecutive or every-other cache line.
The actual branch predicts perfectly, so it can execute (and confirm the prediction) when the load result arrives, in parallel with the stuff waiting for the sign-extending byte load that started at the same place.
and the execution times are exactly the same. ... it actually becomes a lot faster.
Huh? Your times don't show exactly the same.  Yes, it does become measurably faster with that memcmp.
In theory, it should be slower.
Only if your theory is too simplistic to model modern CPUs.  Not slowing it down is easily explainable because there is spare out-of-order exec throughput to do that independent work while waiting for the load latency.
Also possible related for -O0 performance:
For reference, the with-memcmp inner loops from GCC on Godbolt, matching what I benchmarked on my desktop.
# gcc11.1 -O0 -fPIE
.L10:                                           # do{
        mov     eax, DWORD PTR -16[rbp]
        movsx   rdx, eax
        mov     rax, QWORD PTR -32[rbp]
        lea     rcx, [rdx+rax]
        mov     rax, QWORD PTR -40[rbp]
        mov     edx, 10
        mov     rsi, rax
        mov     rdi, rcx
        call    memcmp@PLT
        test    eax, eax
        jne     .L9                             # normally taken jump over puts
        lea     rax, .LC0[rip]
        mov     rdi, rax
        call    puts@PLT
.L9:
        mov     eax, DWORD PTR -16[rbp]
        movsx   rdx, eax
        mov     rax, QWORD PTR -32[rbp]
        add     rax, rdx                       # strange not using an addressing mode, but ok
        movzx   eax, BYTE PTR [rax]
        movsx   eax, al                        # GCC -O0 is dumb,
        cdqe                                   # really dumb.
        mov     eax, DWORD PTR -576[rbp+rax*4]   # look up in rand_array[]
        add     DWORD PTR -16[rbp], eax          # i += ...
.L8:
        cmp     DWORD PTR -16[rbp], 299999999
        jle     .L10                             # }while(i<300000000)
vs. the -O3 code:
# RBP holds  char *tex at at this point
.L10:                                        # do{
        movsx   r12, r13d                     # sign-extend i
        mov     rax, QWORD PTR [rbx]          # str[0..7] gets reloaded because alias analysis and the possible call to puts defeated the optimizer.  Avoiding malloc for it may have helped.
        add     r12, rbp                      # i+tex to avoid indexed addressing modes later?  Probably not optimal
        cmp     QWORD PTR [r12], rax          # first 8 bytes of the memcmp
        je      .L18                          # code at .L18 checks next 2 and maybe does puts, before jumping back
.L5:
        movsx   rax, BYTE PTR [r12]           # sign-extending byte load to pointer width, of tex[i]
        add     r13d, DWORD PTR [rsp+rax*4]   # i += look up in rand_array[]
        cmp     r13d, 299999999
        jle     .L10                          # }while(i < 300000000)
Since the je .L18 is never taken, this is 7 uops per iteration, so Skylake can comfortable issue it at under 2 cycles per iteration.
Even with L1d cache hits, the loop-carried dependency through i (R13D) is:
So a total of about 13 cycle latency best case with L1d hits, leaving tons of free "slots" in the front-end, and idle back-end execution units, that even a call to actual glibc memcmp would be no big deal.
(Of course the -O0 code is noisier so some of those free slots in the pipeline area already used up, but the dep chain is even longer because -O0 code keeps i in memory: Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?)
CPU frequency in memory-bottlenecked loops, especially on Skylake
The init loops provide plenty of time for the CPU to get to max turbo before the timed region, and to pre-fault the allocated pages by touching them first.  Idiomatic way of performance evaluation?
There might also be an effect of keeping CPU frequency higher if there's more work, less stalling, during those waits for incoming data from cache miss loads.  See Slowing down CPU Frequency by imposing memory stress.
In the above results, I tested with EPP = balance_performance.
I also tested with performance, and -O0 (libc memcmp) is still faster than -O3 (inlined memcmp), but all 4 loops did speed up some (with/without memcmp, and optimized or not).
| compiler/CPU | No.1 if(memcmp) | No.2 plain | 
| -O0/balance_performance | 0.028079 s | 0.069492 s | 
| -O0/performance | 0.026898 s | 0.053805 s | 
| -O3/balance_performance | 0.040457 s | 0.061166 s | 
| -O3/performance | 0.035621 s | 0.047475 s | 
 
The faster-at--O0 effect is still present and very significant even with EPP=performance, so we know it's not just a matter of clock speed.  From previous experiments, performance keeps it pegged at max turbo even in memory-bound conditions where other settings would clock down from 4.2 to 2.7 GHz.  So it's very likely that calling memcmp is helping trigger better prefetching to reduce average cache-miss latency.
I unfortunately didn't use a fixed seed for the PRNG so there may be some variation due to the randomness being good vs. bad for the prefetcher, in terms of patterns of whole cache lines accessed.  And I just took one specific run for each (the 2nd one in the pair started by perf stat -r 2, so it should hopefully be even less affected by system fluctuation.)
Looks like performance made a bigger difference to the No.2 loop (with less going on in it), and also to the -O3 version (again with less going on in it), which is consistent with Skylake lowering clock speed when a core is pretty much only bottlenecking on memory, not running many other instructions, in EPP settings other than performance.