Given a URL, I have to find the hostname by using a regex.
The URLs can be of varied forms:
http://www.google.com/                            [expected 'google.com']
https://www.google.com:2000/                      [expected 'www.google.com']
http://100.1.25.3:8000/foo/bar?abc.php=xxxx+xxxx  [expected '100.1.25.3']
www.google.com                                    [expected 'www.google.com']
10.0.2.2:5000                                     [expected '10.0.2.2']
localhost/                                        [expected 'localhost']
localhost/foo                                     [expected 'localhost']
The closest I could come up is with:
^(?:[^:]+://)*([^:/]+).*
and use the string captured by the first capturing group of the regular expression.
However, a few cases fail:
google.com   [nothing is captured, expected 'google.com']
http://///x  ['http' is captured, expected nothing]
What would be a regex that can cope up with these cases?
Please note that:
- I'm not asking what is wrong with my regex. I know where things are wrong, I just can't come up with another regex.
- Solutions only need to reliably extract the hostname, and need not validate it. I later on validate this stuff, so if the regex takes out google!comfromhttps://google!com/foo, this is acceptable*.
* ... and probably even desirable, since hostnames can contain Unicode characters (Internationalized Domain Names).
 
     
    