I am trying to figure out how to make regex capture a bunch of items only that come after one particular thing. I am using Python for this. One example of something like this would be using the text B <4>.<5>  <6> A <1> m<2>  . <3> with the intent of capturing only 1, 2, and 3. I thought a regular expression like  A.*?<(.+?)> would work, but it only caputures the final 3 using Python re.findall. Can I get any help with this?
 
    
    - 779
- 2
- 9
- 19
- 
                    Are you trying to capture the 1, 2, and 3 as separate groups or one group containing all of them? – BrenBarn Oct 06 '13 at 18:30
- 
                    possible duplicate of [Python regex multiple groups](http://stackoverflow.com/questions/4963691/python-regex-multiple-groups) – BrenBarn Oct 06 '13 at 18:33
- 
                    It doesn't matter to me, but I was originally trying to make them in separate groups. – Paul Oct 06 '13 at 18:35
3 Answers
The regex module (going to replace re in future pythons) supports variable lookbehinds, which makes it fairly easy:
s = "B <4>.<5> <6> A23 <1> m<2> . <3>"
import regex
print regex.findall(r'(?<=A\d+.*)<.+?>', s)
# ['<1>', '<2>', '<3>']
(I'm using A\d+ instead of just A to make thing interesting). If you're bound to the stock re, you're forced to ugly workarounds like this:
import re
print re.findall(r'(<[^<>]+>)(?=(?:.(?!A\d+))*$)', s)
# ['<1>', '<2>', '<3>']
or pre-splitting:
print re.findall(r'<.+?>', re.split(r'A\d+', s)[-1])
 
    
    - 211,518
- 52
- 313
- 390
It would be easier with a variable width lookbehind, but an alternate might be to make sure there's no A after the parts you're matching so that you can use something like:
re.findall(r'<(.+?)>(?![^A]*A[^A]*$)', 'B <4>.<5> <6> A <1> m<2> . <3>')
But here's a problem here... (.+?) accepts anything which can break what you're looking for. You can use a negated class: [^>]+ instead of .+?.
This means:
re.findall(r'<([^>]+)>(?![^A]*A[^A]*$)', 'B <4>.<5> <6> A <1> m<2> . <3>')
(?![^A]*A[^A]*$) makes sure there's no A ahead of the part you're capturing.
(?! ... ) is a negative lookahead which makes the match fail if what's inside is matched.
[^A]* matches any character except A
$ matches the end of the string.
 
    
    - 70,495
- 13
- 100
- 144
As it currently stands, your code is matching text between < and > that comes after A followed by zero or more characters.  Furthermore, the only part of your text that fulfills this condition is <1> (which is why that is all that gets returned).
There are many ways to fix this problem, but I think the most straightforward is to first split on A, then use <(.+?)>:
>>> from re import findall, split
>>> text = 'B <4>.<5> <6> A <1> m<2> . <3>'
>>> text = split('A', 'B <4>.<5> <6> A <1> m<2> . <3>')
>>> text
['B <4>.<5> <6> ', ' <1> m<2> . <3>']
>>> text = text[1]
>>> text
' <1> m<2> . <3>'
>>> text = findall('<(.+?)>', text)
>>> text
['1', '2', '3']
>>>
Above is a step-by-step demonstration. Below is the code you will want:
>>> text = 'B <4>.<5> <6> A <1> m<2> . <3>'
>>> findall('<(.+?)>', split('A', text)[1])
['1', '2', '3']
>>>
- 
                    Isnt' it the other way around? (?.+) instead of (.+?) ? I think you are trying to make a "non-greedy" search. Am I right?. EDIT: You are right. It's (.+?) according to Python's reference. – Robson França Oct 06 '13 at 18:36
- 
                    
- 
                    @RobsonFrança `(?.+)` is not valid regex. `(?:.+)` maybe, but not `(?.+)`. – Jerry Oct 06 '13 at 18:39