Match pattern and any character after up to space, and rearrange captured patterns with sed

Question

I would like to find a particular pattern (k__), and any characters after it, up to a space, and then move that captured pattern to the end of the line

With this example file:

cat test.file
37099   k__Eukaryota species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
73015   k__Eukaryota species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Eukaryota species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__
43925   k__Bacteria species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__

So, Id like to match k__Eukaryota and k__Bacteria (and other patterns that start with k__) and then move those captured matches to the end of the line. Desired output:

37099    species:s__Isochrysis galbana;genus:g__Isochrysis;family:f__Isochrysidaceae;order:o__Isochrysidales;class:c__Haptophyta;phylum:p__Haptista k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
73015    species:s__Monodus sp. CCMP505;genus:g__Monodus;family:f__Pleurochloridaceae;order:o__Mischococcales;class:c__Xanthophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Eukaryota
43925    species:s__Nannochloropsis oculata;genus:g__Nannochloropsis;family:f__Monodopsidaceae;order:o__Eustigmatales;class:c__Eustigmatophyceae;phylum:p__ k__Bacteria

I thought it would be easy but I can't get it to go. Here is what I've tried:

cat test.file | gsed -E 's#(.*k__)(k__\w\+)(.*)#\1\3\2#'

Capture text until pattern, then match (capture pattern and any word characters up to whitespace) then capture to the end of the line and then change the order of capturing groups.

I think I can back reference these patterns to change the order but I'm probably not matching them correctly. How to capture up to my pattern, the pattern (k__xyz) and then match to end of line, capture those groups, and reorganize? Is this the right approach?

score 0 · Answer 1 · answered Oct 13 '21 at 05:33

In s#(.*k__)(k__\w\+)(.*)#\1\3\2# the main problem is the first capture group requires k__ and the second one also requires k__. Your file contains one k__ per line.

As you want k__ to move to the end of the line along with the neighboring text, it should belong to the second group. In the first group a feature called positive lookahead could be used to ensure k__ is right after. sed does not support the feature, but you don't really need it here. Your second capture group is right after the first and it requires k__.

The easiest way to fix your command is to remove k__ from the first group:

<test.file gsed -E 's#(.*)(k__\w+)(.*)#\1\3 \2#'

Note I used + instead of \+ because this works in GNU sed in my Debian. I also added a space between \3 and \2 (an alternative: s#(.*)( k__\w+)(.*)#\1\3\2#, so you don't get four spaces after the leading number; but your desired text does specify four spaces there).

A potential problem is .* in the first group is greedy. This is fine when there is just one k__ in the line; otherwise the second group may match some later k__. There are at least two solutions:

simple non-greedy match: in general .*?, but not with sed;
more specific pattern in the fist group, in your case the group may be ( *[0123456789]+ *).

Side note: why <test.file instead of cat test.file | ? See the second half of this answer of mine.

Match pattern and any character after up to space, and rearrange captured patterns with sed

1 Answers1