Moving 2TB (10 mil files + dirs), what's my bottleneck?

Question

Background

I ran out of space on /home/data and need to transfer /home/data/repo to /home/data2.

/home/data/repo contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB.

/home/data is on ext3 with dir_index enabled. /home/data2 is on ext4. Running CentOS 6.4.

I assume these approaches are slow because of the fact that repo/ has 1 million dirs directly underneath it.

Attempt 1: `mv` is fast but gets interrupted

I could be done if this had finished:

/home/data> mv repo ../data2

But it was interrupted after 1.5TB was transferred. It was writing at about 1GB/min.

Attempt 2: `rsync` crawls after 8 hours of building file list

/home/data> rsync --ignore-existing -rv repo ../data2

It took several hours to build the 'incremental file list' and then it transfers at 100MB/min.

I cancel it to try a faster approach.

Attempt 3a: `mv` complains

Testing it on a subdirectory:

/home/data/repo> mv -f foobar ../../data2/repo/
mv: inter-device move failed: '(foobar)' to '../../data2/repo/foobar'; unable to remove target: Is a directory

I'm not sure what this is error about, but maybe cp can bail me out..

Attempt 3b: `cp` gets nowhere after 8 hours

/home/data> cp -nr repo ../data2

It reads the disk for 8 hours and I decide to cancel it and go back to rsync.

Attempt 4: `rsync` crawls after 8 hours of building file list

/home/data> rsync --ignore-existing --remove-source-files -rv repo ../data2

I used --remove-source-files thinking it might make it faster if I start cleanup now.

It takes at least 6 hours to build the file list then it transfers at 100-200MB/min.

But the server was burdened overnight and my connection closed.

Attempt 5: THERES ONLY 300GB LEFT TO MOVE WHY IS THIS SO PAINFUL

/home/data> rsync --ignore-existing --remove-source-files -rvW repo ../data2

Interrupted again. The -W almost seemed to make "sending incremental file list" faster, which to my understanding shouldn't make sense. Regardless, the transfer is horribly slow and I'm giving up on this one.

Attempt 6: `tar`

/home/data> nohup tar cf - . |(cd ../data2; tar xvfk -)

Basically attempting to re-copy everything but ignoring existing files. It has to wade thru 1.7TB of existing files but at least it's reading at 1.2GB/min.

So far, this is the only command which gives instant gratification.

Update: interrupted again, somehow, even with nohup..

Attempt 7: harakiri

Still debating this one

Attempt 8: scripted 'merge' with `mv`

The destination dir had about 120k empty dirs, so I ran

/home/data2/repo> find . -type d -empty -exec rmdir {} \;

Ruby script:

SRC  = "/home/data/repo"
DEST = "/home/data2/repo"

`ls #{SRC}  --color=never > lst1.tmp`
`ls #{DEST} --color=never > lst2.tmp`
`diff lst1.tmp lst2.tmp | grep '<' > /home/data/missing.tmp`

t = `cat /home/data/missing.tmp | wc -l`.to_i
puts "Todo: #{t}"

# Manually `mv` each missing directory
File.open('missing.tmp').each do |line|
  dir = line.strip.gsub('< ', '')
  puts `mv #{SRC}/#{dir} #{DEST}/`
end

DONE.

score 6 · Accepted Answer · answered Sep 18 '13 at 01:05

Ever heard of splitting large tasks into smaller tasks?

/home/data/repo contains 1M dirs, each of which contain 11 dirs and 10 files. It totals 2TB.

rsync -a /source/1/ /destination/1/
rsync -a /source/2/ /destination/2/
rsync -a /source/3/ /destination/3/
rsync -a /source/4/ /destination/4/
rsync -a /source/5/ /destination/5/
rsync -a /source/6/ /destination/6/
rsync -a /source/7/ /destination/7/
rsync -a /source/8/ /destination/8/
rsync -a /source/9/ /destination/9/
rsync -a /source/10/ /destination/10/
rsync -a /source/11/ /destination/11/

(...)

Coffee break time.

score 4 · Answer 2 · edited Mar 31 '14 at 10:59

This is what is happening:

Initially rsync will build the list of files.
Building this list is really slow, due to an initial sorting of the file list.
This can be avoided by using ls -f -1 and combining it with xargs for building the set of files that rsync will use, or either redirecting output to a file with the file list.
Passing this list to rsync instead of the folder, will make rsync to start working immediately.
This trick of ls -f -1 over folders with millions of files is perfectly described in this article: http://unixetc.co.uk/2012/05/20/large-directory-causes-ls-to-hang/

Angelo · Answer 3 · 2018-02-13T18:38:25.360

Even if rsync is slow (why is it slow? maybe -z will help) it sounds like you've gotten a lot of it moved over, so you could just keep trying:

If you used --remove-source-files, you could then follow-up by removing empty directories. --remove-source-files will remove all the files, but will leave the directories there.

Just make sure you DO NOT use --remove-source-files with --delete to do multiple passes.

Also for increased speed you can use --inplace

If you're getting kicked out because you're trying to do this remotely on a server, go ahead and run this inside a 'screen' session. At least that way you can let it run.

score 0 · Answer 4 · answered Nov 05 '22 at 05:02

Could this not have been done using rsync with the --inc-recursive switch along with cron?

Even on a gigabit connection, it would take several hours to move 2 TB without any overhead. Rsync, mv or cp will all add varying amounts of overhead to the I/O, particularly if checksums or other validation is being done.

At least with the --inc-recursive switch, the transfer can start while the list of files is still being built.

I've been taught that --inplace improves speed and reduces space required on the destination, but at a slight reduction in file integrity -- I'd be interested to hear if this is not the case.

If a cron job was then created with whatever rsync settings are appropriate (and whatever is required to mount remote volumes), it could be set to run for a max of 5:58h (using --stop-after=358) and cron could start it every 6h. This way, if it randomly stopped, it would be started again automatically. --remove-source-files could be used with rsync, and find could be used first to delete empty source directories (perhaps decreasing the rsync run time to 5:50h in order to allow find to traverse all the directories).

I recognize that the speed of rsync was slower (as per the OP) but it seems to me that this would have a lower risk of file corruption.

(full disclosure - I'm still learning, so if I'm way off base, please try to be gentle when you let me know...)

Moving 2TB (10 mil files + dirs), what's my bottleneck?

Background

Attempt 1: mv is fast but gets interrupted

Attempt 2: rsync crawls after 8 hours of building file list

Attempt 3a: mv complains

Attempt 3b: cp gets nowhere after 8 hours

Attempt 4: rsync crawls after 8 hours of building file list