Capture a Repeating Group in Python using RegEx (see example)

Question

I am writing a regular expression in python to capture the contents inside an SSI tag.

I want to parse the tag:

<!--#include file="/var/www/localhost/index.html" set="one" -->

into the following components:

Tag Function (ex: include, echo or set)
Name of attribute, found before the = sign
Value of attribute, found in between the "'s

The problem is that I am at a loss on how to grab these repeating groups, as name/value pairs may occur one or more times in a tag. I have spent hours on this.

Here is my current regex string:

^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$

It captures the include in the first group and file="/var/www/localhost/index.html" set="one" in the second group, but what I am after is this:

group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"

(continue for every other name="value" pair)

I am using this site to develop my regex

capture all the tags at once `((?:[a-z]=".*")+?) -->$` then parse it afterwards. Also your regex is needlessly escaped! — Adam Smith, Jul 02 '14 at 20:29
@AdamSmith: That does not work for me. I get two groups when applying that regex: `group 0 : e="/tmp/index.html" set="one" -->`, `group 1: e="/tmp/index.html" set="one"` — NuclearPeon, Jul 02 '14 at 20:32
why not use different patterns for each? It will make it much simpler. — Padraic Cunningham, Jul 02 '14 at 21:39
@PadraicCunningham I thought about doing that, but I was hoping it could be done without. I didn't realize how much effort it would take for something that appeared trivial. — NuclearPeon, Jul 02 '14 at 21:42
The patterns are quite simple individually and if you wanted to create a dict from the key,value pairs it would be very easily accomplished. — Padraic Cunningham, Jul 02 '14 at 21:46

Adam Smith · Accepted Answer · 2014-07-02T20:43:58.857

3

Grab everything that can be repeated, then parse them individually. This is probably a good use case for named groups, as well!

import re

data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''

result = re.match(pat, data)
result.groups()
('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')

Then iterate through it:

g1, g2, g3, g4 = result.groups()
for keyvalue in g4.split(): # split on whitespace
    key, value = keyvalue.split('=')
    # do something with them

edited Jul 02 '14 at 20:43

answered Jul 02 '14 at 20:36

Adam Smith

52,157
12
73
112

`kv = lambda x: x.split('=')` and `{key: val for key, val in [kv(x) for x in m.group(4).split()] }` gives me everything I need in a dictionary. Thanks! – NuclearPeon Jul 02 '14 at 21:30
1

@NuclearPeon skip the conflating lambda! just do `dict([x.split("=") for x in m.group(4).split()])` – Adam Smith Jul 02 '14 at 21:46
1

Thanks, I attempted to do it that way, but got a bunch of errors so I resigned to the lambda. This clears it right up! *Edit: I should have known better* – NuclearPeon Jul 02 '14 at 21:47

Casimir et Hippolyte · Answer 2 · 2014-07-02T21:54:49.077

A way with the new python regex module:

#!/usr/bin/python

import regex

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    (?>
        \G(?<!^)
      |
        <!-- \# (?<function> [a-z]+ )
    )
    \s+
    (?<key> [a-z]+ ) \s* = \s* " (?<val> [^"]* ) "
'''

matches = regex.finditer(p, s)

for m in matches:
    if m.group("function"):
        print ("function: " + m.group("function"))
    print (" key:   " + m.group("key") + "\n value: " + m.group("val") + "\n")

The way with re module:

#!/usr/bin/python

import re

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    <!-- \# (?P<function> [a-z]+ )
    \s+
    (?P<params> (?: [a-z]+ \s* = \s* " [^"]* " \s*? )+ )
    -->
'''

matches = re.finditer(p, s)

for m in matches:
    print ("function: " + m.group("function"))
    for param in re.finditer(r'[a-z]+|"([^"]*)"', m.group("params")):
        if param.group(1):
            print (" value: " + param.group(1) + "\n")
        else:
            print (" key:   " + param.group())

+1 for using regex module, although your print statements will need brackets for python3 compatibility. — NuclearPeon, Jul 02 '14 at 21:14
The use of `re.finditer()` for iterating over repeating groups is the core of the solution to me. Parsing them twice however (as the java example recommends against) looks like CPU waste and code duplication, although it can possibly be thought to otherwise simplify calling code, so exercise caution on that one. — Yann Dirson, Feb 14 '23 at 22:03

score 1 · Answer 3 · edited May 23 '17 at 11:59

1

I recommend against using a single regular expression to capture every item in a repeating group. Instead--and unfortunately, I don't know Python, so I'm answering it in the language I understand, which is Java--I recommend first extracting all attributes, and then looping through each item, like this:

   import  java.util.regex.Pattern;
   import  java.util.regex.Matcher;
public class AllAttributesInTagWithRegexLoop  {
   public static final void main(String[] ignored)  {
      String input = "<!--#include file=\"/var/www/localhost/index.html\" set=\"one\" -->";

      Matcher m = Pattern.compile(
         "<!--#(include|echo|set) +(.*)-->").matcher(input);

      m.matches();

      String tagFunc = m.group(1);
      String allAttrs = m.group(2);

      System.out.println("Tag function: " + tagFunc);
      System.out.println("All attributes: " + allAttrs);

      m = Pattern.compile("(\\w+)=\"([^\"]+)\"").matcher(allAttrs);
      while(m.find())  {
         System.out.println("name=\"" + m.group(1) + 
            "\", value=\"" + m.group(2) + "\"");
      }
   }
}

Output:

Tag function: include
All attributes: file="/var/www/localhost/index.html" set="one"
name="file", value="/var/www/localhost/index.html"
name="set", value="one"

Here's an answer that may be of interest: https://stackoverflow.com/a/23062553/2736496

Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.

edited May 23 '17 at 11:59

Community

1
1

answered Jul 02 '14 at 20:34

aliteralmind

19,847
17
77
108

1

+1 for an answer in Java about a Python regex tested on PHP regex tester. – Casimir et Hippolyte Jul 02 '14 at 20:39
@CasimiretHippolyte: If you are referring to the webpage, regex101.com, then it does have the option to test regex in python which I have selected. – NuclearPeon Jul 02 '14 at 20:42
@aliteralmind: While I cannot use Java, I sincerely appreciate the effort in answering. I realize this question may be considered spam, seeing as there are many questions that ask variations of this. I've been reading various articles on regex, including the python regular expression docs (which I've read more than once). It's hard to wrap my head around. Thank you. – NuclearPeon Jul 02 '14 at 20:47
1

@AdamSmith Regarding jwz’s quip, it is true only insofar as a little knowledge being always a dangerous thing: ***“Perilous to us all are the devices of an art deeper than we possess ourselves.”*** – tchrist Jul 02 '14 at 20:56
1

@NuclearPeon: Glad to help. I just wanted to express the idea of iterating through the groups, as opposed to trying to do it in one big mega regex. Wrong language, but same concepts. – aliteralmind Jul 02 '14 at 21:13
I'm getting really good insight into what regex *should* be used for from this question. Fancy dynamic programming, not so much... – NuclearPeon Jul 02 '14 at 21:15

Chrispresso · Answer 4 · 2014-07-02T20:54:30.517

0

Unfortunately python does not allow for recursive regular expressions.
You can instead do this:

import re
string = '''<!--#include file="/var/www/localhost/index.html" set="one" set2="two" -->'''
regexString = '''<!--\#(?P<tag>\w+)\s(?P<name>\w+)="(?P<value>.*?")\s(?P<keyVal>.*)\s-->'''
regex = re.compile(regexString)
match = regex.match(string)
tag = match.group('tag')
name = match.group('name')
value = match.group('value')
keyVal = match.group('keyVal').split()
for item in keyVal:
    key, val in item.split('=')
    # You can now do whatever you want with the key=val pair

edited Jul 02 '14 at 20:54

answered Jul 02 '14 at 20:46

Chrispresso

3,660
2
19
31

1

`ValueError: too many values to unpack (expected 2)` you can't iterate through `for key, val in item.split` like that. I did the same thing actually. – Adam Smith Jul 02 '14 at 20:48
You're right, changed to just splitting it instead there – Chrispresso Jul 02 '14 at 20:54

score 0 · Answer 5 · answered Feb 02 '21 at 07:06

The regex library allows capturing repeated groups (while builtin re does not). This allows for a simple solution without needing external for-loops to parse the groups afterwards.

import regex

string = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
rgx = regex.compile(
    r'<!--#(?<fun>[a-z]+)(\s+(?<key>[a-z]+)\s*=\s*"(?<val>[^"]*)")+')

match = rgx.match(string)
keys, values = match.captures('key', 'val')
print(match['fun'], *map(' = '.join, zip(keys, values)), sep='\n  ')

gives you what you're after

include
  file = /var/www/localhost/index.html
  set = one

Capture a Repeating Group in Python using RegEx (see example)

5 Answers5