I want to remove HTML comments from an html text
<h1>heading</h1> <!-- comment-with-hyphen --> some text <-- con --> more text <hello></hello> more text
should result in:
<h1>heading</h1> some text <-- con --> more text <hello></hello> more text
I want to remove HTML comments from an html text
<h1>heading</h1> <!-- comment-with-hyphen --> some text <-- con --> more text <hello></hello> more text
should result in:
<h1>heading</h1> some text <-- con --> more text <hello></hello> more text
You shouldn't ignore Carriage return.
re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)
html = re.sub(r"<!--(.|\s|\n)*?-->", "", html)
re.sub basically find the matching instance and replace with the second arguments. For this case, <!--(.|\s|\n)*?--> matches anything start with <!-- and end with -->. The dot and ? means anything, and the \s and \n add the cases of muti line comment.
Finally came up with this option:
re.sub("(<!--.*?-->)", "", t)
Adding the ? makes the search non-greedy and does not combine multiple comment tags.
Don't use regex. Use an XML parser instead, the one in the standard library is more than sufficient.
from xml.etree import ElementTree as ET
html = ET.parse("comments.html")
ET.dump(html) # Dumps to stdout
ET.write("no-comments.html", method="html") # Write to a file
re.sub("(?s)<!--.+?-->", "", s)
or
re.sub("<!--.+?-->", "", s, flags=re.DOTALL)
You could try this regex <![^<]*>