I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string.
For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml.
On the other hand, if my string contained only a plain URL without HTML tags, this answer recommends using a regular expression.
I wasn't able to find a good solution given my string contains both HTML encoded URL as well as a plain URL. Here is some example code:
import lxml.html
example_data = """<a href="http://www.some-random-domain.com/abc123/def.html">Click Me!</a>
http://www.another-random-domain.com/xyz.html"""
dom = lxml.html.fromstring(example_data)
for link in dom.xpath('//a/@href'):
    print "Found Link: ", link
As expected, this results in:
Found Link:  http://www.some-random-domain.com/abc123/def.html
I also tried the twitter-text-python library that @Yannisp mentioned, but it doesn't seem to extract both URLS:
>>> from ttp.ttp import Parser
>>> p = Parser()
>>> r = p.parse(example_data)
>>> r.urls
['http://www.another-random-domain.com/xyz.html']
What is the best approach for extracting both kinds of URLs from a string containing a mix of HTML and non HTML encoded data? Is there a good module that already does this? Or am I forced to combine regex with BeautifulSoup/lxml?
 
     
     
    