I have a web page that contains a <td> tag, for example
<td>Aug 17, 2017 02:00 PM EDT</td>
I'm trying to use regex to find content in the page matching the format of , then a space then four numbers then a space then two numbers then : then two numbers space two capital letters space three capital letters. Just to make sure I always target that date and not accidentally get something else.
I don't think another instance of that format would ever occur, but I'd want the first instance. I guess I could just grab the [0] position in the returned variable to be sure I get the correct date.
I'm have the following regex so far:
(?=\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)
So, in python code:
date = re.findall(r'(?=\,\s\d{4}\s\d{2}\:\d{2}\s[A-Z]{2}\s[A-Z]{3})(.*)(?=\<\/td)', page)
print(date[0])
This gets me close, but not quite all the way. It gets me
, 2017 02:00 PM EDT
Whereas I need
Aug 17, 2017 02:00 PM EDT
But I can't figure out how to extend the regex to grab all of the td. Thanks for any help!
(btw, Python 3)
Edit adding decode
page = response.read().decode('utf-8')