non-capturing parenthesis with lookbehind and lookahead - Python

Question

So I want to capture the indices in a string like this:

 "Something bad happened! @ data[u'string_1'][u'string_2']['u2'][0]"

I want to capture the strings string_1, string_2, u2, and 0.

I was able to do this using the following regex:

re.findall("("
           "((?<=\[u')|(?<=\['))" # Begins with [u' or ['
           "[a-zA-Z0-9_\-]+" # Followed by any letters, numbers, _'s, or -'s
           "(?='\])" # Ending with ']
           ")"
           "|" # OR
           "("
           "(?<=\[)" # Begins with [
           "[0-9]+" # Followed by any numbers
           "(?=\])" # Endging with ]
           ")", message)

Problem is the result will include tuples with empty strings, as such:

[('string_1', '', ''), ('string_2', '', ''), ('u2', '', ''), ('', '', '0')]

Now I can easily filter out the empty strings from the result, but I would like to prevent them from appearing in the first place.

I believe that the reason for this is due to my capture groups. I tried to use ?: in those group, but then my results were completely gone.

This is how I had attempted to do it:

re.findall("(?:"
           "((?<=\[u')|(?<=\['))" # Begins with [u' or ['
           "[a-zA-Z0-9_\-]+" # Followed by any letters, numbers, _'s, or -'s
           "(?='\])" # Ending with ']
           ")"
           "|" # OR
           "(?:"
           "(?<=\[)" # Begins with [
           "[0-9]+" # Followed by any numbers
           "(?=\])" # Endging with ]
           ")", message)

That results in the following output:

['', '', '', '']

I'm assuming the issue is due to me using lookbehinds along with the non-capturing groups. Any ideas on whether this is possible to do in Python?

Thanks

Srdjan M. · Answer 1 · 2018-03-21T21:17:09.763

1

Regex: (?<=\[)(?:[^'\]]*')?([^'\]]+) or \[(?:[^'\]]*')?([^'\]]+)

Python code:

def Years(text):
        return re.findall(r'(?<=\[)(?:[^\'\]]*\')?([^\'\]]+)', text)

print(Years('Something bad happened! @ data[u\'string_1\'][u\'string_2\'][\'u2\'][0]'))

Output:

['string_1', 'string_2', 'u2', '0']

edited Mar 21 '18 at 21:17

answered Feb 06 '18 at 21:25

Srdjan M.

3,310
3
13
34

vks · Accepted Answer · 2018-02-06T22:14:01.900

1

You can simplify your regex.

(?<=\[)u?'?([a-zA-Z0-9_\-]+)(?='?\])

See demo .

https://regex101.com/r/SA6shx/1

edited Feb 06 '18 at 22:14

answered Feb 06 '18 at 21:35

vks

67,027
10
91
124

That will end up capturing the `u'`. I know you're probably going to say that I can use `re.search` and use the `groups`, but groups only capture the very last match of the sub-pattern. See here https://stackoverflow.com/a/9765390 – Mo2 Feb 06 '18 at 21:58
My apologies. This works. Now I'm wondering why it's not grabbing the `u'` – Mo2 Feb 06 '18 at 22:09
@Mo2 re.findall only gives group if it has one...if not then whole match – vks Feb 06 '18 at 22:10
I see. TIL. Thank you. I can't remove the downvote unless the answer is edited. It won't let me. – Mo2 Feb 06 '18 at 22:13
1

Done. Thank you again – Mo2 Feb 06 '18 at 22:16

non-capturing parenthesis with lookbehind and lookahead - Python

2 Answers2