(As has been said many times before, the best solution is to use an HTML parser.)
With GNU grep, try this simplified version:
grep -zPo '<img alt=[^/]+?src="\K[^"]+' ~/movie_local
A fixed version of your original attempt (note the (?s) prefix; see below for an explanation):
grep -zPo '(?s)> <img alt=".*?src="\K.*?(?=")' ~/movie_local
Alternative, with [\s\S] used ad-hoc to match any char., including \n:
grep -zPo '> <img alt="[\s\S]*?src="\K.*?(?=")' ~/movie_local
As for why your attempt didn't work:
When you use -P (for PCRE (Perl-Compatible Regular Expression support), . does not match \n chars. by default, so even though you're using -z to read the entire input at once, .* won't match across line boundaries. You have two choices:
- Set option
s ("dotall") at the start of the regex - (?s) - this makes . match any character, including \n
- Ad-hoc workaround: use
[\s\S] instead of .
As an aside: the \K construct is a syntactically simpler and sometimes more flexible alternative to a lookbehind assertion ((?<=...).
- Your command had both, which did no harm in this case, but was unnecessary.
- By contrast, had you tried
(?<=>\s*<img alt=") for more flexible whitespace matching - note the \s* in place of the original single space - your lookbehind assertion would have failed, because lookbehind assertions must be of fixed length (at least as of GNU grep v2.26).
However, using just \K would have worked: >\s*<img alt=")\K.
\K simply removes everything matched so far (doesn't include it in the output).