I'm trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?
I'm trying to extract parapgraphs from HTML by using the following line of code:
paragraphs = re.match(r'<p>.{1,}</p>', html)
but it returns none even though I know there is. Why?
Why don't use an HTML parser to, well, parse HTML. Example using BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <p>text1</p>
... <p></p>
... <p>text2</p>
... </div>
... """
>>> soup = BeautifulSoup(data, "html.parser")
>>> [p.get_text() for p in soup.find_all("p", text=True)]
[u'text1', u'text2']
Note that text=True helps to filter out empty paragraphs.
Make sure you use re.search (or re.findall) instead of re.match, which attempts to match the entire html string (your html is definitely not beginning and ending with <p> tags).
Should also note that currently your search is greedy meaning it will return everything between the first <p> tag and the last </p> which is something you definitely do not want. Try
re.findall(r'<p(\s.*?)?>(.*?)</p>', response.text, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
instead. The question mark will make your regex stop matching at the first closing </p> tag, and findall will return multiple matches compared to search.
You should be using re.search instead of re.match. The former will search the entire string whereas the latter will only match if the pattern is at the beginning of the string.
That said, regular expressions are a horrible tool for parsing HTML. You will hit a wall with them very shortly. I strongly recommend you look at HTMLParser or BeautifulSoup for your task.