How to copy multiple-line-regex outputs into clipboard using Notepad++

Question

I have a fasta file containing genome sequences of multiple viruses.

Example:

>gi_138375030_Human_papillomavirus
GAAAGTTTCAATCATACTTTATTATATTGGGAGTAAAAAAAA...

>gi_94481944_Human_herpesvirus_3
GGCCCAGCCCTCTCGCGGCCCCCTCGAGAGAGAAAAAAA...

I want to extract only herpes virus entries, including the actual sequence, which is (in this file) always the line folowing the description.

The folowing regex works:

>.*herpes.*\n.*\n

It selects the description and the sequence lines.

I have found similar questions but all make use of the "bookmark line" function: Export all regular expression matches in Textpad or Notepad++ as a list

However, this only bookmarks the first line of the regex output, so I am unable to use the described solutions. If I use "find all in current document", it also only lists the first lines.

All I want to do is copy the output of regex into a new file. It is especially frustrating since it finds just above a hundred entries, which is just above the margin under which I would be willing to do it manually.

I would prefer a solution in Windows OS.

g2mk · Answer 1 · 2015-11-24T19:48:34.177

You could try to combine RegEx search with a macro (standard Npp shortcuts):

Ensure that you have an empty line at file end - it is useful when using Run macro to end of file main menu entry.
Search (Ctrl+f) for you sequence >.*herpes.*\n.*\n - don't allow to wrapping by file begin.
Move to file begin (Ctrl+Home).
Search again (F3).
Start a macro recording (Ctrl+Shift+r).
Go to line begin (Home) - you should be at the beginning of a first sequence line.
Bookmark line (Ctrl+F2).
Move cursor to end of second line (Down and then End).
Bookmark the other line (Ctrl+F2).
Search again (F3).
Stop macro recording (Ctrl+Shift+r).

Now you should have a working macro. You can check it by playing it (Ctrl+Shift+p). If something goes wrong you can undo Ctrl+z or reload file from disk (another main menu entry) and try to record working macro again.

Then:

Run macro to the end of the file.
Now you can copy bookmarked lines or delete unboomarked ones and...

score 2 · Accepted Answer · answered Nov 24 '15 at 21:24

You could make a copy of the file and then, on the copy, search and replace the negation of what you want:

(?!>.*herpes.*)^(>.*\R)([ATGC]+\R)

The above will (or ought to) find paired lines that do not have herpes. Couple this with a blank replace field, you will wind up with a file that has only what you are looking for.

SΛLVΘ · Answer 3 · 2015-11-28T10:54:04.733

Not an Npp solution; in Windows PowerShell:

Select-String "herpes" viruses.fas -context 0, 2 | % { $_.Line ; $_.Context.PostContext } | clip

Handier batch version:

@echo off
powershell "$what  = Read-Host String to search      ; "^
           "$where = Read-Host In which file         ; "^
           "Select-String $what $where -context 0, 2 | "^
           "%% { $_.Line ; $_.Context.PostContext }  | "^
           "clip"

Save it with a .bat extension (eg. "clipvir.bat") into the same folder where you've got .fas files. You may create a shortcut to the script on your quick launch / applications bar, or on your desktop.

score 0 · Answer 4 · edited Dec 01 '15 at 14:29

0

I used the following solutions:

use regex ">.*herpes.*\n[\nAGCTN]*" in **EditPad lite** and use its "search>copy_matches" option

or use:

cat virus_all.fasta | pcregrep --buffer-size 1000000 -M ">.*herpes.*\n[\nAGCTN]*" > herpes1.fasta

in bash shell

the regex works even if the sequence follows the header in multiple lines. In the second example you end up with a new file.

edited Dec 01 '15 at 14:29

Dawny33

188

answered Dec 01 '15 at 12:54

moomox

35

How to copy multiple-line-regex outputs into clipboard using Notepad++

4 Answers4