Regular expression syntax problem

Question

$pattern='`<a\s+[^>]*(href=([\'\"]).*\\2)[^>]*>([^<]*)</a>`isU';

And I want to change ([^<]*) this to search for </a> not only < cause <img> tag could be inside <a> tag.

Can anyone help, I'm lousy at regex.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — dynamic, Jun 13 '11 at 16:19

Francis Gilbert · Answer 1 · 2011-06-13T16:45:53.757

2

You can use a PHP parser to do this. I wouldn't use Regex at all.

You can try: http://simplehtmldom.sourceforge.net/

Although I think PHP has a DOM parser built in.

edited Jun 13 '11 at 16:45

answered Jun 13 '11 at 16:22

Francis Gilbert

3,382
2
22
27

score 1 · Accepted Answer · answered Jun 13 '11 at 16:19

1

Changing ([^<]*)to a ungreedy match all (.*?) might do the trick

answered Jun 13 '11 at 16:19

Mick Hansen

2,685
18
14

1

@yes123 while i do agree, he did ask for a regex fix. – Mick Hansen Jun 13 '11 at 16:22
then you should reply him to not use regex – dynamic Jun 13 '11 at 16:22
@Mick I second that. Regex is not the tool to parse HTML, and recommending it only encourages people to try it nevertheless. – Tomalak Jun 13 '11 at 16:25
1

No downvote from me, BUT regex is just the wrong tool for this job. Showing the OP the correct way to solve his problem is much more helpful to him. – soulmerge Jun 13 '11 at 16:26
Noted, and thanks for the critique - Not that seasoned with Stackoverflow yet but i'll surely keep it in mind in the future. – Mick Hansen Jun 13 '11 at 16:27

score 0 · Answer 3 · answered Jun 13 '11 at 16:35

([^<]*) could be changed to ((?:[^<]|<(?!/a>))*), which uses a negative lookahead to match non-< characters or < characters which are not followed by /a>. See it in action here.

HOWEVER, as stated many times over already, this is not a good way to parse HTML. Firstly, it's horribly inefficient, and secondly, what happens if you have nested tags, such as <a><a></a></a>? While this may not happen with hyperlinks, it's common among many other HTML elements.

Regular expression syntax problem

3 Answers3