Count based on unique subset of fields

Question

I have a text file that is structured as follows:

P,ABC,DEF
P,GHI,JKL
B,ABC,DEF
B,MNO,PQR

I want to get a count of how many times a line appears where fields 2 and 3 are the same while preserving field 1. So, the output would look something like this:

2,P,ABC,DEF
1,P,GHI,JKL
2,B,ABC,DEF
1,B,MNO,PQR

uniq -c won't work (as far as I know) because it can't separate by field. sort -u -t, -k2,2 -k3,3 also won't work as it can't count (as far as I know) and the command as written will simply destroy the third line as a duplicate while leaving the first.

At the end of the day, what I need to be returned are lines 2 and 4 as fields 2 and 3 combined are unique. But, I need to preserve field 1 as it refers to which dataset (in the real world) fields 2 and 3 originate from. So, a solution that returns lines 2 and 4 is really what I need.

Accordingly, a solution as follows works as well:

P,GHI,JKL
B,MNO,PQR

score 1 · Accepted Answer · answered Jan 12 '21 at 08:32

Taking your sort command, I can delegate -u to uniq -u, which allows me to use the -f option of uniq. This option ignores the given number of leading fields. You want to ignore the first field, so -f1. For this to work I need to translate each , to a blank and back:

<data sort -t, -k2,2 -k3,3 | tr , ' ' | uniq -u -f1 | tr ' ' ,

While this works with your example dataset, it fails when there are blanks. This is because uniq -f recognizes a field as [[:blank:]]*[^[:blank:]]*. If there are blanks in your actual data then they will make uniq recognize more fields than you want.

To overcome this you need to translate actual blanks to non-blanks, perform uniq, then translate back. In the POSIX locale [:blank:] includes the space and the tab character only; in other locales it may include more.

The following command temporarily translates spaces to DC1 characters (device control 1, octal 021) and tabs to DC2 (device control 2, octal 022):

<data sort -t, -k2,2 -k3,3 | tr ' \t,' '\021\022 ' | uniq -u -f1 | tr '\021\022 ' ' \t,'

It should work, if only the data contains no DC1 nor DC2.

Even if your tr does not support multi-byte characters, the translation will not interfere with multi-byte characters of UTF-8 because the most significant bit in each byte in a multi-byte character in UTF-8 is always 1, while for DC1 or DC2 it's 0.

Count based on unique subset of fields

1 Answers1