View Single Post
Ralph's Avatar
Posts: 14 | Thanked: 10 times | Joined on Jan 2010 @ Italy
#37
Here's my little contribution in using regular expressions to correctly parse the URL:

Code:
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> p = re.compile('(http|https)?:?\/?\/?([a-z]*\.?[a-z]*\.[a-z]*)\/?(.*)')
>>> m = re.search(p,'http://www.nagios.org')
>>> m.group(1)
'http'
>>> m.group(2)
'www.nagios.org'
>>> m.group(3)
''
>>> m = re.search(p,'https://www.nagios.org/foo')
>>> m.group(1)
'https'
>>> m.group(2)
'www.nagios.org'
>>> m.group(3)
'foo'
>>> m = re.search(p,'nagios.org/foo')
>>> m.group(1)
>>> m.group(2)
'nagios.org'
>>> m.group(3)
'foo'
>>>
There are still some potential issues if URL is written in uppercase or if a trailing slash is inserted after a subfolder, http://WWW.NAGIOS.ORG/nagios/ is still not parsed correctly by the above regexp but with a couple of str.lower() and str.strip() you can have a lowercase string without a trailing /.

Hope this can help
 

The Following User Says Thank You to Ralph For This Useful Post: