perf stat has some named "metrics" that it knows how to calculate from other things.  According to perf list on my system, those include L3_Cache_Access_BW and L3_Cache_Fill_BW.
- L3_Cache_Access_BW
[Average per-core data access bandwidth to the L3 cache [GB / sec]]
- L3_Cache_Fill_BW
[Average per-core data fill bandwidth to the L3 cache [GB / sec]]
This is from my system with a Skylake (i7-6700k).  Other CPUs (especially from other vendors and architectures) might have different support for it, or IDK might not support these metrics at all.
I tried it out for a simplistic sieve of Eratosthenes (using a bool array, not a bitmap), from a recent codereview question since I had a benchmarkable version of that (with a repeat loop) lying around.  It measured 52 GB/s total bandwidth (read+write I think).
The n=4000000 problem-size I used thus consumes 4 MB total, which is larger than the 256K L2 size but smaller than the 8MiB L3 size.
$ echo 4000000 | 
 taskset -c 3 perf stat --all-user  -M L3_Cache_Access_BW -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions  ./sieve 
 Performance counter stats for './sieve-buggy':
     7,711,201,973      offcore_requests.all_requests #  816.916 M/sec                  
                                                  #    52.27 L3_Cache_Access_BW     
     9,441,504,472 ns   duration_time             #    1.000 G/sec                  
          9,439.41 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
             1,020      page-faults               #  108.058 /sec                   
    38,736,147,765      cycles                    #    4.104 GHz                    
    53,699,139,784      instructions              #    1.39  insn per cycle         
       9.441504472 seconds time elapsed
       9.432262000 seconds user
       0.000000000 seconds sys
Or with just -M L3_Cache_Access_BW and no -e events, it just shows offcore_requests.all_requests #    54.52 L3_Cache_Access_BW and duration_time.  So it overrides the default and doesn't count cycles,instructions and so on.
I think it's just counting all off-core requests by this core, assuming (correctly) that each one involves a 64-byte transfer.  It's counted whether it hits or misses in L3 cache.  Getting mostly L3 hits will obviously enable a higher bandwidth than if the uncore bottlenecks on the DRAM controllers instead.