How could I append 'index.html' to all links in a html file that do not end with that word ?
So that, for example, href="http://mysite/" would become href="http://mysite/index.html".
How could I append 'index.html' to all links in a html file that do not end with that word ?
So that, for example, href="http://mysite/" would become href="http://mysite/index.html".
I am not a sed expert, but think this works:
sed -e "s_\"\(http://[^\"]*\)/index.html\"_\"\1\"_g" \
-e "s_\"\(http://[^\"]*[^/]\)/*\"_\"\1/index.html\"_g"
The first replacement finds URLS already ending in /index.html and deletes this ending.
The second replacement adds the /index.html as required. It deals with cases that end in / and also those that don't.
More than one version of sed exists. I'm using the one that comes in XCode for OS X.
What about this:
echo 'href="http://mysite/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://mysite/index.html"
echo 'href="http://www.google.com/"' | awk '/http/ {sub(/\/\"/,"/index.html\"")}1'
href="http://www.google.com/index.html"
for href ending with /
sed '\|href="http://.*/| s||\1index.html' YourFile
if there is folder ref without ending /, you should specifie what is consider as a file or not (like last name with a dot inside for file, ...)
In general this is an almost unsolvable problem. If your html is "reasonably well behaved", the following expression searches for things that "look a lot like a URL"; you can see it at work at http://regex101.com/r/bZ9mR8 (this shows the search and replace for several examples; it should work for most others)
((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_@-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\@\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?
The result of the above match should be replaced with
\1index.html
Unfortunately this requires regex wizardry that is well beyond the rather pedestrian capabilities of sed, so you will have to unleash the power of perl, as follows:
perl -p -e '((?:(?:https?|ftp):\/{2})(?:(?:[0-9a-z_@-]+\.)+(?:[0-9a-z]){2,4})?(?:(?:\/(?:[~0-9a-z\#\+\%\@\.\/_-]+))?\/)*(?=\s|\"))(\/)?(index\.html?)?/\index.html/gi'
It looks a bit daunting, I know. But it works. The only problem - if a link ends in /, it will add /index.html. You could easily take the output of the above and process it with
sed 's/\/\/index.html/\/index.html/g'
To replace a double-backslash-before-index.html with a single backslash...
Some examples (several more given in the link above)
http://www.index.com/ add /index.html
http://ex.com/a/b/" add /index.html
http://www.example.com add /index.html
http://www.example.com/something do nothing
http://www.example.com/something/ add /index.html
http://www.example.com/something/index.html do nothing