How to do the equivalent of "grep something * -Rin" on list of tar.gz files?

Question

I've got a bunch of tar.gz files and I want to do a "grep something * -Rin" as I would on them if they weren't tar.gzed. I want to keep them tar.gzed as they are, but grep on them on-the-fly and find the occurrences of my grep with the prefixed file and line number.

Something like:

grep mytoken1 *.tar.gz -Rin

and get something like:

my1.tar.gz,dir1/file2:123:mytoken1 is in this line  
my2.tar.gz,dir2/file3:233:mytoken1 is also in this other line  
[...]

Is there a way of doing this?

Joe · Answer 1 · 2012-01-31T09:44:39.020

5

zgrep (or, we believe, grep with the -Z flag) will let you grep the compressed files and I think will tell you much of what you want, but this doesn't give you the filename without a bit more work looking at the header :(

edited Jan 31 '12 at 09:44

answered Jan 31 '12 at 06:50

Joe

3,520

score 4 · Answer 2 · edited May 23 '17 at 11:33

Found in Unix script to search within a .tar or .gz file :

The script :

for file in $(tar -tzf file.tar.gz | grep '\.txt'); do 
    tar -Oxzf file.tar.gz "$file" | grep -B 3 --label="$file" -H "string-or-regex"
done

will respect file boundaries and report the file names. The | grep '\.txt part can be adapted to your needs or dropped.

(-z tells tar it is gzip compressed. -t lists the contents. -x extracts. -O redirects to standard output rather than the file system. Older tars may not have the -O or -z flag, and will want the flags without -: e.g. tar tz file.tar.gz)

If your grep does not support these flags, then one can use awk :

#!/usr/bin/awk -f
BEGIN { context=3; }
{ add_buffer($0) }
/pattern/ { print_buffer() }
function add_buffer(line)
{
    buffer[NR % context]=line
}
function print_buffer()
{
    for(i = max(1, NR-context+1); i <= NR; i++) {
        print buffer[i % context]
    }
}
function max(a,b)
{
    if (a > b) { return a } else { return b }
}

This will not coalesce adjacent matches, unlike grep -B, and can thus repeat lines that are within 3 lines of two different matches.

score 2 · Answer 3 · answered Feb 06 '12 at 01:28

One way would be to use this quick hack:

#!/usr/bin/ruby

=begin
Quick-and-dirty way to grep in *.tar.gz archives

Assumption:
    each and every file read from any of the supplied tar archives
    will fit into memory. If not, the data reading has to be rewritten
    (a proxy that reads line-by-line would have to be inserted)
=end

require 'rubygems'
gem 'minitar'
require 'zlib'
require 'archive/tar/minitar'

if ARGV.size < 2
    STDERR.puts "#{File.basename($0)} <regexp> <file>+"
    exit 1
end

regexp = Regexp.new(ARGV.shift, Regexp::IGNORECASE)

for file in ARGV
    zr = Zlib::GzipReader.new(File.open(file, 'rb'))
    Archive::Tar::Minitar::Reader.new(zr).each do |e|
        next unless e.file?
        data = e.read
        if regexp =~ data
            data.split(/\n/).each_with_index do |l, i|
                puts "#{file},#{e.full_name}:#{i+1}:#{l}" if regexp =~ l
            end
        end
    end
end

which is not to say I'd recommend it for bigger archives, as each file from the archive is read into memory (twice, actually).

If you want a bit more memory-efficient version, you'd either have to go with different implementation of the e.read loop... or, perhaps, with a different language altogether. ;)

I could make it a bit more efficient if you're really interested... but it will definitely not compare with C or other compiled languages, in terms of raw speed.

score 0 · Answer 4 · edited Jan 31 '12 at 10:30

I think this will be very tricky.

In fact, tar is basically a concatenation of all its includes files, with headers addition. So basically a grep-in-tar function could be written to deal with that and provide information on file and line number (basic grep with header reading and line number substraction). I've not heard of such a program.

The problem is with gzip. This is a compression format so you need to decompress it if you want to access the content.

gunzip -c files.tgz | grep-in-tar

would be a way to do what you want. At the moment you can try gunzip -c files.tgz | grep -Rin but it will just say that the binary file matches.

score 0 · Answer 5 · answered Feb 06 '12 at 23:45

The modular approach to *nix tools means that there's no simple way to do this efficiently with grep / tar / zcat. Ideally you want to decompress the files only once, and process each tar file in a single pass. Here's my attempt at tgz-grep:

#!/usr/bin/python
import re,sys,tarfile

exp=re.compile(sys.argv[1])
tarfiles=sys.argv[2:]

for tfile in tarfiles:
  tar=tarfile.open(tfile, mode='r|gz')
  for file in tar:
    name=file.name
    count=0
    for line in tar.extractfile(file):
      count += 1
      if exp.search(line):
        print "%s,%s:%d:%s" % (tfile, name, count, line),

Note: this doesn't do directory recursion (-R) or case-insensitvity (-i), or other options supported by GNU grep, but they wouldn't be tricky to add.

How to do the equivalent of "grep something * -Rin" on list of tar.gz files?

5 Answers5