2

I have a fasta file containing genome sequences of multiple viruses.

Example:

>gi_138375030_Human_papillomavirus
GAAAGTTTCAATCATACTTTATTATATTGGGAGTAAAAAAAA...

>gi_94481944_Human_herpesvirus_3
GGCCCAGCCCTCTCGCGGCCCCCTCGAGAGAGAAAAAAA...

I want to extract only herpes virus entries, including the actual sequence, which is (in this file) always the line folowing the description.

The folowing regex works:

>.*herpes.*\n.*\n

It selects the description and the sequence lines.

I have found similar questions but all make use of the "bookmark line" function: Export all regular expression matches in Textpad or Notepad++ as a list

However, this only bookmarks the first line of the regex output, so I am unable to use the described solutions. If I use "find all in current document", it also only lists the first lines.

All I want to do is copy the output of regex into a new file. It is especially frustrating since it finds just above a hundred entries, which is just above the margin under which I would be willing to do it manually.

I would prefer a solution in Windows OS.

moomox
  • 35

4 Answers4

2

You could try to combine RegEx search with a macro (standard Npp shortcuts):

  • Ensure that you have an empty line at file end - it is useful when using Run macro to end of file main menu entry.
  • Search (Ctrl+f) for you sequence >.*herpes.*\n.*\n - don't allow to wrapping by file begin.
  • Move to file begin (Ctrl+Home).
  • Search again (F3).
  • Start a macro recording (Ctrl+Shift+r).
  • Go to line begin (Home) - you should be at the beginning of a first sequence line.
  • Bookmark line (Ctrl+F2).
  • Move cursor to end of second line (Down and then End).
  • Bookmark the other line (Ctrl+F2).
  • Search again (F3).
  • Stop macro recording (Ctrl+Shift+r).

Now you should have a working macro. You can check it by playing it (Ctrl+Shift+p). If something goes wrong you can undo Ctrl+z or reload file from disk (another main menu entry) and try to record working macro again.

Then:

  • Run macro to the end of the file.
  • Now you can copy bookmarked lines or delete unboomarked ones and...
g2mk
  • 1,446
  • 12
  • 17
2

You could make a copy of the file and then, on the copy, search and replace the negation of what you want:

(?!>.*herpes.*)^(>.*\R)([ATGC]+\R)

The above will (or ought to) find paired lines that do not have herpes. Couple this with a blank replace field, you will wind up with a file that has only what you are looking for.

Yorik
  • 4,988
0

Not an Npp solution; in Windows PowerShell:

Select-String "herpes" viruses.fas -context 0, 2 | % { $_.Line ; $_.Context.PostContext } | clip

Handier batch version:

@echo off
powershell "$what  = Read-Host String to search      ; "^
           "$where = Read-Host In which file         ; "^
           "Select-String $what $where -context 0, 2 | "^
           "%% { $_.Line ; $_.Context.PostContext }  | "^
           "clip"

Save it with a .bat extension (eg. "clipvir.bat") into the same folder where you've got .fas files. You may create a shortcut to the script on your quick launch / applications bar, or on your desktop.

SΛLVΘ
  • 1,465
0

I used the following solutions:

use regex ">.*herpes.*\n[\nAGCTN]*" in **EditPad lite** and use its "search>copy_matches" option

or use:

cat virus_all.fasta | pcregrep --buffer-size 1000000 -M ">.*herpes.*\n[\nAGCTN]*" > herpes1.fasta

in bash shell

the regex works even if the sequence follows the header in multiple lines. In the second example you end up with a new file.

Dawny33
  • 188
moomox
  • 35