I'm still quite new to awk and have been trying to use a bash script and awk to filter a file according to a list of codes in a separate text file. While there are a few similar questions around, I have been unable to adapt their implementations.
My first file idnumber.txtlooks like this:
4323-7584
K8933-4943
L2837-0493
The file I am attempting to filter the molecule blocks from has entries as follows:
  -ISIS-  -- StrEd -- 
 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (K784-9550)
K784-9550
$$$$
  -ISIS-  -- StrEd -- 
 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-7584)
4323-7584
$$$$
  -ISIS-  -- StrEd -- 
 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-7584)
L2789-0943
$$$$
  -ISIS-  -- StrEd -- 
 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-2738)
4323-2738
> <SALT>
NaCl
$$$$
The file repeats in this fashion, starting with the -ISIS- -- StrEd -- and ending with the $$$$. I need to extract this entire block for each string in IDNUMBER. So the expected output would be the block from above from -ISIS- to the $$$$ that has a matching ID in the IDNUMBER.txt.
Each entry is a different length, and I am trying to extract the entire block from the -ISIS- -- StrEd --
I have tried a few options of sed trying to recognise the first line to the IDNUMBER and extracting around it but that didn't work. My current iteration of the code is as follows:
#!/bin/bash
cat idnumbers.txt | while read line
do
  sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
done
The logic behind what I was attempting was to find the block that would match the start as the ISIS phrase and end with the relevant ID number, copying that to a file. I realise now that what my logic was doing would skip the $$$$ that terminates each block.
But I have a feeling I am missing something as it is not actually writing anything to filtered.sdf.
Expected output:
  -ISIS-  -- StrEd -- 
 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-7584)
4323-7584
$$$$
  -ISIS-  -- StrEd -- 
 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-7584)
L2789-0943
$$$$
Edit: So I have tried a different approach based on another question but have not been able to figure out how to alter the key assigned to a record in awk based on recognizing the characters at the line containing the IDNUMBER because it is a different field for each record.
awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
     (NR==FNR){a[$1]=$0; next}
     ($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt
I assume it would be a matter of changing the field reference in the array $1 to an expression that recognizes the line after > <IDNUMBER>(xyz), but I am unsure how to go about achieving that.
 
     
    