For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching HTML tags, selecting/deleting inner/outer tags, and edit the content in-place.
Here is the command:
$ ex +"%s/^[^>].*>\([^<]\+\)<.*/\1/g" +"g/[a-zA-Z]/d" +%p -scq! file.html
54
1
0 (0/0)
0
This is how the command works:
Use ex in-place editor to substitute on all lines (%) by: ex +"%s/pattern/replace/g".
The substitution pattern consists of 3 parts:
- Select from the beginning of line till
> (^[^>].*>) for removal, right before the 2nd part.
- Select our main part till
< (([^<]+)).
- Select everything else after
< for removal (<.*).
- We replace the whole matching line with
\1 which refers to pattern inside the brackets (()).
After substitution, we remove any alphanumeric lines by using global: g/[a-zA-Z]/d.
- Finally, print the current buffer on the screen by
+%p.
- Then silently (
-s) quit without saving (-c "q!"), or save into the file (-c "wq").
When tested, to replace file in-place, change -scq! to -scwq.
Here is another simple example which removes style tag from the header and prints the parsed output:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
However, it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).
See also: