0

I have over 10000 files in a folder. I was using an Rscript to preprocess the files. It displayed an error:

Error in read.table(wd, comment.char ="#", header=T, sep='\t'): empty beginning of file

When I opened the file in a text editor the file was empty but the size of the file was around 4 MB. Next, I opened the file in a Notepad++, I was able to see the content as NULL NULL NULL ... NULL

File example

I want to move these kind of files from the folder to another folder. How can I accomplish this?

Destroy666
  • 12,350
svp
  • 103
  • 2

1 Answers1

2

Testing a single file

grep in the following command will return exit status 0 if some_file contains at least one null character:

<some_file tr -dc '\0' | tr '\0' '\n' | grep -q ''

Unless the shell option pipefail is set, the exit status of grep will become the exit status of the whole pipeline, if trs exit. pipefail is unset by default and you want it this way (see what may happen otherwise).

I wrote "if trs exit" because after grep exits the second tr needs to write something in order to get SIGPIPE; then the first tr needs to write something in order to get SIGPIPE; only then the pipe is considered terminated. It may happen the first tr keeps and keeps reading even if grep exits early and the outcome is known. If some_file is a special file generating a neverending stream of bytes (similar to e.g. /dev/urandom) and there is not enough null bytes in the stream then the pipe will never exit. For a regular file the worst case scenario is when the first tr exits after reading the whole file. If some_file is a regular file then trs will exit eventually for sure.

This answer of mine explains a trick you can use to speed things up. In your case the trick will leave tr(s) in the background. Since you're going to test many files, piling up trs is not a good idea.

In practice it's often enough to test the very beginning of a file. The following command will read up to 2 KiB of some_file and analyze only this part:

head -c 2048 some_file | tr -dc '\0' | tr '\0' '\n' | grep -q ''

Alternatively you can use the command file, for a big file it won't read the whole file either. Here we generate exit status 0 if file --mime-type does not print text/whatever:

! file --brief --mime-type some_file | grep -q 'text/'

I expect the two commands to agree in vast majority of cases; there may be cases (files) where they differ though.


Testing many files (and moving accordingly)

This snippet will loop over files in the current working directory, test regular files and move them accordingly:

#!/bin/bash
(
shopt -s nullglob
for f in ./*; do
   [ -f "$f" ] \
   && ! [ -L "$f" ] \
   && head -c 2048 "$f" | tr -dc '\0' | tr '\0' '\n' | grep -q '' \
   && mv -v "$f" /target/directory/
done
)

Notes:

  • Create /target/directory/ beforehand.

  • You can use the other test. The relevant line will be:

       && ! file --brief --mime-type "$f" | grep -q 'text/' \
    
  • The subshell (…) is in case you want to paste the code into an interactive shell. Thanks to the subshell, the code won't change anything in your current shell.

  • Normally * does not match hidden files. Append dotglob to the shopt -s line to make * match hidden files.

  • If you want recursiveness, append globstar to the shopt -s line and use ./** instead of ./*. Be careful, if there are files with identical names then you may lose data; consider mv -i.

  • We want to conditionally move regular files. [ -f "$f" ] checks if we're dealing with a regular file; but it also succeeds for a symlink to (a symlink to (a symlink to (…))) a regular file. This is the reason we additionally check if the file is not a symlink (! [ -L "$f" ]). If you want the code to treat symlinks to regular files like regular files then delete the whole line containing [ -L (including the terminating newline character).

  • In general commands and code in this answer are not portable. You tagged and said the OS is Ubuntu, I made use of what Bash and Ubuntu provide.

  • A solution with find is possible. Each of our tests is a pipeline, so find would have to spawn shell(s) anyway.