35

I've got a filesystem which has a couple million files and I'd like to see a distribution of file sizes recursively in a particular directory. I feel like this is totally doable with some bash/awk fu, but could use a hand. Basically I'd like something like the following:

1KB: 4123
2KB: 1920
4KB: 112
...
4MB: 238
8MB: 328
16MB: 29138
Count: 320403345

I feel like this shouldn't be too bad given a loop and some conditional log2 filesize foo, but I can't quite seem to get there.

Related Question: How can I find files that are bigger/smaller than x bytes?.

notpeter
  • 1,207

8 Answers8

44

This seems to work pretty well:

find . -type f -print0 | xargs -0 ls -l | awk '{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n", 2^i, size[i])}' | sort -n

Its output looks like this:

         0   1
         8   3
        16   2
        32   2
        64   6
       128   9
       256   9
       512   6
      1024   8
      2048   7
      4096  38
      8192  16
     16384  12
     32768   7
     65536   3
    131072   3
    262144   3
    524288   6
   2097152   2
   4194304   1
  33554432   1
 134217728   4
where the number on the left is the lower limit of a range from that value to twice that value and the number on the right is the number of files in that range.

garyjohn
  • 36,494
31

Based on garyjohn's answer, here is a one-liner, which also formats the output to human readable:

find . -type f -print0 | xargs -0 ls -l | awk '{ n=int(log($5)/log(2)); if (n<10) { n=10; } size[n]++ } END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' | sort -n | awk 'function human(x) { x[1]/=1024; if (x[1]>=1024) { x[2]++; human(x) } } { a[1]=$1; a[2]=0; human(a); printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'

Here is the expanded version of it:

find . -type f -print0                                                   \ 
 | xargs -0 ls -l                                                        \
 | awk '{ n=int(log($5)/log(2));                                         \
          if (n<10) n=10;                                                \
          size[n]++ }                                                    \
      END { for (i in size) printf("%d %d\n", 2^i, size[i]) }'           \
 | sort -n                                                               \ 
 | awk 'function human(x) { x[1]/=1024;                                  \
                            if (x[1]>=1024) { x[2]++;                    \
                                              human(x) } }               \
        { a[1]=$1;                                                       \ 
          a[2]=0;                                                        \
          human(a);                                                      \
          printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }' 

In the first awk I defined a minimum file size to collect all the files less than 1kb to one place. In the second awk, function human(x) is defined to create a human readable size. This part is based on one of the answers here: https://unix.stackexchange.com/questions/44040/a-standard-tool-to-convert-a-byte-count-into-human-kib-mib-etc-like-du-ls1

The sample output looks like:

  1k:    335
  2k:     16
 32k:      5
128k:     22
  1M:     54
  2M:     11
  4M:     13
  8M:      3
dzsuz87
  • 411
4

Try this:

find . -type f -exec ls -lh {} \; | 
 gawk '{match($5,/([0-9.]+)([A-Z]+)/,k); if(!k[2]){print "1K"} \
        else{printf "%.0f%s\n",k[1],k[2]}}' | 
sort | uniq -c | sort -hk 2 

OUTPUT :

 38 1K
 14 2K
  1 30K
  2 62K
  12 2M
  2 3M
  1 31M
  1 46M
  1 56M
  1 75M
  1 143M
  1 191M
  1 246M
  1 7G

EXPLANATION :

  • find . -type f -exec ls -lh {} \; : simple enough, find files in the current dir and run ls -lh on them

  • match($5,/([0-9.]+)([A-Z]+)/,k); : this will extract the file size, and save each match into the array k.

  • if(!k[2]){print "1K"} : if k[2] is undefined the file size is <1K. Since I am imagining you don't care about such tiny sizes, the script will print 1K for all files whose size is <=1K.

  • else{printf "%.0f%s\n",k[1],k[2]} : if the file is larger than 1K, round the file size to the closest integer and print along with its modifier (K,M, or G).

  • sort | uniq -c : count the occurrences of each line (file size) printed.

  • sort -hk 2 : sort according to the second field in human readable format. This way, 7G is sorted after 8M.

terdon
  • 54,564
2

Building on top of a couple of the earlier answers, a script (longer) with more detailed output.

$ cat ~/bin/fszdist
#!/bin/bash

AWK_SCRIPT='

prettyfier

function psize(s) { kilo = 0; mega = 0; giga = 0; if (s >= 1024) { s /= 1024; kilo = 1; } if (s >= 1024) { s /= 1024; mega = 1; } if (s >= 1024) { s /= 1024; giga = 1; } suffix = (giga == 1) ? "g" : ((mega == 1) ? "m" : ((kilo == 1) ? "k" : "")) return int(100*s + 0.5)/100 suffix }

{ i = ($5 == 0) ? -1 : int(log($5)/log(2)); ++sz[i]; bytes[i] += $5 } END { acc=0; bs_acc = 0; n=length(sz); printf("%11s %10s %10s %10s %10s\n", "[n, 2*n)", "count", "acc_count", "bytes", "acc_bytes"); for (i=-1; i < n; ++i) { acc += sz[i]; bs_acc += bytes[i] p = (i == -1) ? 0 : 2^i; printf("%11s %10d %10d %10s %10s\n", psize(p), sz[i], acc, psize(bytes[i]), psize(bs_acc)); } } '

find . -type f -print0 |
xargs -0 ls -l |
awk "$AWK_SCRIPT"

Running it on a directory cloned from some github repository:

$ fszdist
   [n, 2*n)      count  acc_count      bytes  acc_bytes
          0        329        329          0          0
          1         29        358         29         29
          2         53        411        121        150
          4          4        415         24        174
          8         41        456        489        663
         16         62        518      1.39k      2.04k
         32        829       1347     38.33k     40.37k
         64        716       2063     69.21k    109.58k
        128       2013       4076    394.55k    504.13k
        256       3761       7837      1.37m      1.86m
        512       5758      13595      4.15m      6.02m
         1k      10728      24323     15.93m     21.95m
         2k      15539      39862     42.84m     64.78m
         4k      10321      50183     59.33m    124.11m
         8k       8694      58877     90.85m    214.96m
        16k       3585      62462     75.22m    290.18m
        32k       3741      66203    154.93m    445.11m
        64k        571      66774     47.59m     492.7m
       128k        454      67228     79.71m    572.41m
       256k        235      67463     88.38m    660.79m
       512k        109      67572     77.87m    738.65m
         1m         58      67630     88.43m    827.08m
         2m         68      67698    179.28m   1006.36m
         4m         44      67742    245.79m      1.22g
         8m          9      67751    102.13m      1.32g
        16m         14      67765     287.1m       1.6g
        32m          8      67773    392.46m      1.99g
        64m         11      67784      1.02g         3g
       128m         11      67795      1.83g      4.83g
       256m          2      67797     583.2m       5.4g
       512m          3      67800      1.72g      7.13g
         1g          0      67800          0      7.13g
2

Based on dzsuz87's answer, which in turn was based on garyjohn's answer (credit to both!), I thought it would be good to add an actual histogram to the output, in order to more easily visualize the results if there is a large range of file sizes. Still effectively a one-liner, my new code looks like:

find . -type f -print0 | xargs -0 ls -l | awk '
    {
            n=int(log($5)/log(2));
            if (n<10) { n=10; }
            size[n]++
    }
    END {
            for (i in size)
                    printf("%d %d\n", 2^i, size[i])
    }' | sort -n | awk -v COLS=$( tput cols ) -v PAD=13 -v MAX=0 '
    function human(x) {
            x[1]/=1024;
            if (x[1]>=1024) { x[2]++; human(x) }
    }
    {
            if($2>MAX)MAX=$2;
            a[$1]=$2
    }
    END {
            PROCINFO["sorted_in"] = "@ind_num_asc";
            for (i in a){
                    h[1]=i;
                    h[2]=0;
                    human(h);
                    bar=sprintf("%-*s", a[i]/MAX*(COLS-PAD), "");
                    gsub(" ", "-", bar);
                    printf("%3d%s: %6d %s\n", h[1], \
                         substr("kMGTEPYZ",h[2]+1,1), a[i], bar)
            }
    }'

Sample output:

  1k:    505 --------
  2k:     45
  4k:   4609 --------------------------------------------------------------------------
  8k:   2177 ----------------------------------
 16k:    325 -----
 32k:    642 ----------
 64k:   2262 ------------------------------------
128k:   2547 ----------------------------------------
256k:    977 ---------------
512k:    434 ------
  1M:    550 --------
  2M:   1076 -----------------
  4M:   2028 --------------------------------
  8M:   2362 -------------------------------------
 16M:   1814 -----------------------------
 32M:    989 ---------------
 64M:    366 -----
128M:     86 -
256M:     16
512M:      1

Note this does require gawk and not just regular awk, and uses tput to collect the size of your terminal in order to scale the histogram.

1

I stumbled upon this question because I also wanted to view a distribution of my files sizes. In my case, however, I had no need for power of 2 buckets. I used a different bash command to view the file size distribution:

ls -URs1Q --block-size=M | cut -d\" -f1 | tr -d ' ' | sort -n | uniq -c

Explaining options:

  • U: do not sort files, this makes it quicker

  • R: recursive, in case you want to included nested directories

  • s: print the size of each file

  • 1: print each entry on a single lines, to avoid columns

  • Q: quote the file name, so we can use it as delimiter

  • --block-size=M: scale sizes by MB

  • cut -d\" -f1: cut at the first quote and return the first element = size

  • tr -d ' ': delete all space characters

  • sort -n: sort values in natural order

  • uniq -c: only show uniq values, but include the count

This will show a result like:

     28 0M
 228602 1M
   1393 2M
    238 3M
    107 4M
     82 5M
     41 6M
     32 7M
     33 8M
     24 9M
     24 10M
     15 11M
     20 12M
     15 13M
     14 14M
     19 15M
      8 16M
     13 17M
      6 18M
      7 19M
      4 20M
      6 21M
      2 22M
      1 23M
      4 24M
      4 25M
      4 27M
      1 29M
      2 30M
      1 2239M

In case of directories, totals are included in the list, you can omit them with:

ls -URs1Q --block-size=M | cut -d\" -f1 | tr -d ' ' | sort -n | uniq -c | grep -v total

It does not answer OP question correctly, but it may help others looking for a similar but different solution.

Marja
  • 111
0

Here's the code, if you'd like to compare line count, instead of raw byte size

find . -type f -print0 | xargs -0 wc -l | awk '{size[int(log($1)/log(2))]++}END{for (i in size) printf("%10d %3d\n", 2^i, size[i])}' | sort -n
Danon
  • 101
0

I had to do something similar but on some petabyte of data, and since running find would takes ages (a week in my estimation) I tried to run find in parallel on subdirectories, of course this might be useless in some cases, but worked for me, so here is the script I used for the task: file size distribution stats

mestia
  • 292