The way to check a HDFS directory's size?

Question

I know du -sh in common Linux filesystems. But how to do that with HDFS?

Matt D · Accepted Answer · 2015-01-30T03:36:27.073

197

Prior to 0.20.203, and officially deprecated in 2.6.0:

hadoop fs -dus [directory]

Since ~~0.20.203~~ (dead link) 1.0.4 and still compatible through 2.6.0:

hdfs dfs -du [-s] [-h] URI [URI …]

You can also run hadoop fs -help for more info and specifics.

edited Jan 30 '15 at 03:36

answered Jun 28 '11 at 14:19

Matt D

3,055
1
18
17

28

-du -s (-dus is deprecated) – Carlos Rendon Jan 03 '13 at 22:11

Marius Soutier · Answer 2 · 2016-08-06T05:26:48.200

90

hadoop fs -du -s -h /path/to/dir displays a directory's size in readable form.

edited Aug 06 '16 at 05:26

answered Feb 18 '15 at 13:51

Marius Soutier

11,184
1
38
48

For newer versions of hdfs, `hdfs -du -s -h /path/to/dir` it's more appropriate. – Adelson Araújo Nov 05 '19 at 18:42

mrsrinivas · Answer 3 · 2021-09-11T16:23:14.940

Extending to Matt D and others answers, the command can be till Apache Hadoop 3.0.0

hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]

It displays sizes of files and directories contained in the given directory or the length of a file in case it's just a file.

Options:

The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, the calculation is done by going 1-level deep from the given path.

The -h option will format file sizes in a human-readable fashion (e.g 64.0m instead of 67108864)

The -v option will display the names of columns as a header line.

The -x option will exclude snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path.

`du` returns three columns with the following format:

 +-------------------------------------------------------------------+ 
 | size  |  disk_space_consumed_with_all_replicas  |  full_path_name | 
 +-------------------------------------------------------------------+

Example command:

hadoop fs -du /user/hadoop/dir1 \
    /user/hadoop/file1 \
    hdfs://nn.example.com/user/hadoop/dir1

Exit Code: Returns 0 on success and -1 on error.

source: Apache doc

+1 for the information about the results! I didn't understand why I was getting two results (size and disk_space) instead of one. Thanks! — Ric S, Mar 02 '21 at 13:26

score 13 · Answer 4 · edited Jul 04 '18 at 17:07

13

With this you will get size in GB

hdfs dfs -du PATHTODIRECTORY | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

edited Jul 04 '18 at 17:07

OneCricketeer

179,855
19
132
245

answered Jun 24 '16 at 11:43

dilshad

734
1
10
27

1

hdfs dfs -du PATHTODIRECTORY | awk '/^[0-9]+/ { print int($1/(1024**3) " [GB]\t" $2 }' - Please update your command. Two closing bracket after 1024**3. It should be only 1 – gubs Sep 14 '18 at 14:40

Grr · Answer 5 · 2019-01-08T21:50:18.207

When trying to calculate the total of a particular group of files within a directory the -s option does not work (in Hadoop 2.7.1). For example:

Directory structure:

some_dir
├abc.txt    
├count1.txt 
├count2.txt 
└def.txt

Assume each file is 1 KB in size. You can summarize the entire directory with:

hdfs dfs -du -s some_dir
4096 some_dir

However, if I want the sum of all files containing "count" the command falls short.

hdfs dfs -du -s some_dir/count*
1024 some_dir/count1.txt
1024 some_dir/count2.txt

To get around this I usually pass the output through awk.

hdfs dfs -du some_dir/count* | awk '{ total+=$1 } END { print total }'
2048

score 3 · Answer 6 · answered Mar 18 '21 at 17:30

3

The easiest way to get the folder size in a human readable format is

hdfs dfs -du -h /folderpath

where -s can be added to get the total sum

answered Mar 18 '21 at 17:30

Galuoises

2,630
24
30

score 2 · Answer 7 · answered Nov 13 '16 at 06:05

2

To get the size of the directory hdfs dfs -du -s -h /$yourDirectoryName can be used. hdfs dfsadmin -report can be used to see a quick cluster level storage report.

answered Nov 13 '16 at 06:05

Harikrishnan Ck

920
1
11
12

The -s did the trick, otherwise, it gave me a full list of files which I then have to tally up. – Hein du Plessis Jul 01 '21 at 19:53

score 1 · Answer 8 · answered Sep 02 '18 at 08:27

1

hadoop version 2.3.33:

hadoop fs -dus  /path/to/dir  |   awk '{print $2/1024**3 " G"}'

answered Sep 02 '18 at 08:27

LuciferJack

781
11
12

score 0 · Answer 9 · edited Jan 18 '18 at 10:56

0

% of used space on Hadoop cluster
sudo -u hdfs hadoop fs –df

Capacity under specific folder:
sudo -u hdfs hadoop fs -du -h /user

edited Jan 18 '18 at 10:56

JimHawkins

4,843
8
35
55

answered Oct 27 '14 at 12:55

Oren Efron

409
5
8

I got an error with "hdfs", the way it worked for me was: `hadoop fs -du -h /user` (i didn't need to use `sudo`) – diens Jan 04 '19 at 15:23
`sudo` is not needed and should be used sparingly. – Climbs_lika_Spyder Aug 08 '19 at 15:38

score 0 · Answer 10 · edited Aug 22 '19 at 04:27

hdfs dfs -count <dir>

info from man page:

-count [-q] [-h] [-v] [-t [<storage type>]] [-u] <path> ... :
  Count the number of directories, files and bytes under the paths
  that match the specified file pattern.  The output columns are:
  DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
  or, with the -q option:
  QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA
        DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME

score 0 · Answer 11 · answered Oct 12 '22 at 06:30

Incase if someone is need through pythonic way :)

Install hdfs python package

pip install hdfs

code

from hdfs import InsecureClient

client = InsecureClient('http://hdfs_ip_or_nameservice:50070',user='hdfs')
folder_info = client.content("/tmp/my/hdfs/path")

#prints folder/directory size in bytes
print(folder_info['length'])

score -1 · Answer 12 · answered Sep 19 '17 at 05:58

Command Should be hadoop fs -du -s -h \dirPath

-du [-s] [-h] ... : Show the amount of space, in bytes, used by the files that match the specified file pattern.
-s : Rather than showing the size of each individual file that matches the
pattern, shows the total (summary) size.
-h : Formats the sizes of files in a human-readable fashion rather than a number of bytes. (Ex MB/GB/TB etc)

Note that, even without the -s option, this only shows size summaries one level deep into a directory.

The output is in the form size name(full path)

The way to check a HDFS directory's size?

12 Answers12

`hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]`

Options:

`du` returns three columns with the following format:

Example command:

Linked

The way to check a HDFS directory's size?

12 Answers12

hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]

Options:

du returns three columns with the following format:

Example command:

Linked

`hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]`

`du` returns three columns with the following format: