what are these regex expressions meaning?

Question

preg_match( '/<title>(.*)<\/title>/',.....)
preg_match("/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i",....)

Looks like they would extract information from a HTML page. The title and the addresses of images. — Felix Kling, Feb 10 '11 at 08:43

score 6 · Accepted Answer · edited May 23 '17 at 12:13

6

The first is to extract the contents from a HTML title tag.

The second is to extract images' src attributes from a HTML document, but is very imperfect (It won't catch references to image resources that end in .jpeg or have no extension at all).

Regular expressions are not a good idea for parsing HTML! One should use a HTML parser instead. They are far from fireproof.

edited May 23 '17 at 12:13

Community

1
1

answered Feb 10 '11 at 08:43

Pekka

442,112
142
972
1,088

@Pekka Yes, always tell'em [to not do that](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). +1 – Linus Kleen Feb 10 '11 at 08:45
Note that the first regex will fail if the line (or, in multi-line mode, the _entire document_) has multiple `` elements. That may be unlikely for this specific case but in general produces very bad results. – Chris Lutz Feb 10 '11 at 08:48
1

why the edit? `The regexes will probably both do a half-way decent job - if part of an existing project, you can probably leave them be. But they are far from fireproof, and if you're building stuff from scratch, don't use this approach.` Most people will continue to use bad code but the should be encouraged to fix it instead. – beggs Feb 10 '11 at 08:49
Also I thinknthe second regex is terrible too for the same reason. It's very lazy about validating what can and can't be in a string and may grab too much unless I'm badly mistaken. – Chris Lutz Feb 10 '11 at 08:51
@beggs I'd say it depends on the situation. If it's a newbie finding his way through production code, it won't be their first priority. In general however, you're right, edited that out. @Chris good points! – Pekka Feb 10 '11 at 08:51
i want to know what are these signals(/ (.*) \) meaning. – runeveryday Feb 10 '11 at 09:10
1

@runeveryday they are patterns and delimiters. `(.*)<\/title>` means "grab everything up to the next occurrence of `` and return it as part of the result. The `/` is used as a delimiter around the expression. There's some more info here http://www.regular-expressions.info/php.html – Pekka Feb 10 '11 at 09:12
thank you,i know, but to the second line code. why there is no / delimiter after ]?/i – runeveryday Feb 10 '11 at 09:17
@runeveryday `i` is a flag that comes after the delimiter, specifying case insensitive search (in order to also catch `JPG`, `GIF` ....) – Pekka Feb 10 '11 at 09:17

score 0 · Answer 2 · answered Feb 10 '11 at 08:46

1) Matches anything between <title> and </title> a la an HTML page's title, so run against <title>foo</title> results in the match being foo.

2) Matches any string following src= that ends in png, jpg or gif. Used to extract the URL of images in HTML code.

Per @Pekka's answer: don't do this in real world code.

what are these regex expressions meaning?

2 Answers2