How to find the common paths from a list of paths/files

Question

Prelude:

Given a sorted input of a list of paths/files, how to find their common paths?

Translating into tech term, if feeding the sorted input from stdin, how to pick the shortest proper prefix from the stdin?

Here the "prefix" has the normal meaning, e.g., string 'abcde' has a prefix of 'abc'. Here is my sample input

$ echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2'
/home/dave
/home/dave/file1
/home/dave/sub2/file2

This is an example to remove successive proper prefix from the stdin, using the command sed:

$ echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2' | sed "N; /^\(.*\)\n\1\//D; P; D" 
/home/dave/file1
/home/dave/sub2/file2

Question:

My question is how to preserve the proper prefix instead, and remove all the lines that have that prefix. Sine both /home/dave/file1 and /home/dave/sub2/file2 has the prefix of /home/dave, the /home/dave will be preserved while the other two not. I.e., it will do the complete opposite of what above sed command does.

More info:

The input would be sorted already
If I have /home/dave /home/dave/file1 /home/phil /home/phil/file2 (echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2\n/home/phil\n/home/phil/file2'), I would expect /home/dave and /home/phil to be the answer.

Application:

I have two disk volumes containing similiar content. I want to copy what's in v1 but missing from v2 into another disk volume, v3. Using find, sort, and comm, I am able to get a list of what to copy, but I need to further clean up that list. I.e., as long as I have /home/dave in the list, I don't need the other two.

Thanks!

Dan D. · Accepted Answer · 2014-09-04T01:52:46.517

This answer uses Python. As the OP wanted to remove the directories covered by their parents as I had seen as a possiblity I began writing a different program to remove coverings:

Example:

$ echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2\n/home/phil\n/home/phil/file1' | removecoverings 
/home/phil
/home/dave

Code of the removecoverings command:

#!/usr/bin/env python2

import sys

def list_startswith(a, b):
    if not len(a) >= len(b):
        return False
    return all(x == y for x,y in zip(a[:len(b)],b))

def removecoverings(it):
    g = list(it)
    g.sort(key=lambda v: len(v.split('/')), reverse=True)
    o = []
    while g:
        c = g.pop()
        d = []
        for v in g:
            if list_startswith(v.split('/'), c.split('/')):
                d.append(v)
        for v in d:
            g.remove(v)
        o.append(c)
    return o

for o in removecoverings(l.strip() for l in sys.stdin.readlines()):
    print o

This answer uses Python. It also does a component-wise rather than string-wise common prefix. Better for paths as the common prefix of /ex/ample and /exa/mple should be / not /ex. This assumes that what is wanted is the greatest common prefix and not a list of prefixes with their coverings removed. If you have /home/dave /home/dave/file1 /home/phil /home/phil/file2 and expect /home/dave /home/phil rather than /home. This is not the answer you would be looking for.

Example:

$ echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2' | commonprefix 
/home/dave

Code of the commonprefix command:

#!/usr/bin/env python2

import sys

def commonprefix(l):
    # this unlike the os.path.commonprefix version
    # always returns path prefixes as it compares
    # path component wise
    cp = []
    ls = [p.split('/') for p in l]
    ml = min( len(p) for p in ls )

    for i in range(ml):

        s = set( p[i] for p in ls )         
        if len(s) != 1:
            break

        cp.append(s.pop())

    return '/'.join(cp)

print commonprefix(l.strip() for l in sys.stdin.readlines())

vinegarbin · Answer 2 · 2016-03-02T00:05:16.647

And, the one liner version of xpt's answer. Again, assuming sorted input:

perl -lne 'BEGIN { $l="\n"; }; if ($_ !~ /^\Q$l/) { print $_; $l = $_; }'

Run on the example input

/home/dave
/home/dave/file1
/home/dave/sub2/file2
/home/phil
/home/phil/file2

using

echo -e '/home/dave\n/home/dave/file1\n/home/dave/sub2/file2\n/home/phil\n/home/phil/file2' | perl -lne 'BEGIN { $l="\n"; }; if ($_ !~ /^\Q$l/) { print $_; $l = $_; }'

gives

/home/dave
/home/phil

The magic is in the command-line arguments to perl: -e allows us to give a script on the command line, -n iterates over the lines of the file (placing each line in $_), and -l deals with newlines for us.

The script works by using l to track the last prefix seen. The BEGIN block is run before the first line is read, and initializes the variable to a string that won't be seen (no newlines). The conditional is run on each line of the file (held by $_). The conditional is executed on all of the lines of the file, and says "if the line does not have the current value of l as a prefix, then print the line and save it as the value of l." Because of the command-line arguments, this is essentially identical to the other script.

The catch is that both scripts assume that the common prefix exists as its own line, so don't find the common prefix for input like

/home/dave/file1
/home/dave/file2

score 0 · Answer 3 · answered Sep 06 '14 at 13:14

Given that the input is sorted, The pseudo code would be:

$seen = last_line;
if current_line begins exactly as $seen then next
else { output current_line; $seen = current_line }

Translating into Perl code (Yes Perl, the most beautiful script language of all):

perl -e '
my $l = "\n";
while (<>) {
    if ($_ !~ /^\Q$l/) {
        print;
        chomp;
        $l = $_;
    }
}
'

Credit: Ben Bacarisse @bsb.me.uk, from comp.lang.perl.misc. Thanks Ben, it works great!

How to find the common paths from a list of paths/files

3 Answers3