6

Lets say I have a sitemap.xml file with this data:

<url>
<loc>http://domain.com/pag1</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag2</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag3</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>

I want to extract all the locations from it (data between <loc> and </loc>).

Sample output be like:

http://domain.com/pag1
http://domain.com/pag2
http://domain.com/pag3

How to do this?

6 Answers6

9

If you're on a Linux box or something with the grep tool, you can just run:

grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml

bobmagoo
  • 844
2

You can use python script here

This script get any links started with http

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('>(http:\/\/.+)<',d)
    for i in data:
        print i

And in your case next script find all data wraped in tags

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
    for i in data:
        print i

Here nice tool to play with regexp if you not familiar with it.

if you need to load remote file you can use next code

import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
  data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
  for i in data:
    print i
2

This could be accomplished by a single sed command, which seems to be more solid than the grep solution:

sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile

(found at: linuxquestions.org)

LarS
  • 245
1

Using XSLT, you can render it out with XPath

/url/loc
Siva Charan
  • 4,959
0

The XSLT solution:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:s="http://www.sitemaps.org/schemas/sitemap/0.9">

  <xsl:output method="text" />

  <xsl:template match="s:url">
    <xsl:value-of select="s:loc" />
    <xsl:text>
</xsl:text>
  </xsl:template>

</xsl:stylesheet>
Jan Tomka
  • 101
0

You can open your sitemap.xml file in Notepad++.

Then in the menu Search → Replace (CTRL+H) specify:

Find what: </loc>.*?<loc>

Replace with: \r\n

Set Search mode to Regular Expression

and then click Replace All button.

Additionally you can sort the links via the Menu Edit → Line Operations → Sort Lines in Ascending Order

Stano
  • 151
  • 3