0

I have what seems to be a simple task but isn't working for me. Using only PCRE2 Regex (nothing else), I am trying to collect a phrase before the first colon at the beginning of a line, then separate and place within the same group all the comma separated values.

Here are some sample texts:

Shapes: circle, rectangle, triangle  
Junk line: this part, here, should work, but: make sure, that last colon, isn't caught

Should be captured as such:

Group 1:

Shapes:  

Group 2:

circle  
rectangle  
triangle  

Group 1:

Junk line:  

Group 2:

this part  
here  
should work  
but: make sure  
that last colon  
isn't caught  

I know comma separated values can be captured many ways, like this:

([^,]+)

But if I try to add anything to the beginning, the match stops after the first comma, so this:

(.*):([^,]+)

Will not work (plus it captures the second colon in a line anyway). Any help is appreciated!

EDITED TO ADD: The matching should stop at the end of the line, so something like this:

One: two, three  
Yellow: Blue, Green  

Should not catch Yellow as part of two, three. Yellow should be caught as a new instance of group one

Destroy666
  • 12,350

2 Answers2

1

This regex should work for your case if I understood correctly:

(?:^|\n)([^:]+):|(?:\s?)([^,\n]+)(?:,|$)

Basically, first you match anything from the start of line or newline until a colon (?:^|\n)([^:]+):. If you want to include the colon (as examples show but the 1st sentence says otherwise), just move the capture group to include the colon.

Then you have an alternative of matching phrases consisting of:

  • (?:\s?) - non-captured optional whitespace
  • ([^,\n]+) - anything that's not a comma or newline
  • (?:,|$) - either a non-captured comma or end of line

Demo: https://regex101.com/r/qOS9Hc/1

But as I mentioned in a comment below the question, I'm not sure why you're using a regex for that. It's much simplier with basic text processing - splitting by colon, then splitting by ,

Also note that this can capture some other kind of inputs, unless you add (?!^)(?<=\G) at the beginning of the 2nd alternative:

(?:^|\n)([^:]+):|(?!^)(?<=\G)(?:\s?)([^,\n]+)(?:,|$)

Which makes sure that the 1st match of the phrase with a colon appears at the beginning of the string.

Destroy666
  • 12,350
1

Using PCRE2, language independant, I'd use:

(^[^:]+:|\G(?!^))\h*([^,\r\n]+),?

Demo & explanation

Toto
  • 19,304