0

The gist: the small company I work for advertises its products through Google Merchant. We upload the products in an XML file as per Google's requirements.

The problem: manually formatting thousands of products into XML is an arduous task. What I want is a rapid-fire way to convert the relevant information on each product page into formatted XML. I'm looking for a (semi-)automatic way to go from bigHTMLSourceCode --> formattedXML.

If I'm not being clear, imagine wanting to format an Amazon product page into XML. You want the cost, description, weight, etc., arrayed in a certain way, with the appropriate XML tags, etc., and doing so for thousands of products isn't tenable.

I've Googled extensively, but haven't had any luck finding programs that can help with this.

MrT
  • 63

2 Answers2

0

If your HTML is XHTML, you can probably use XSLT ?

There are tools to convert HTML to XML

The main alternative would be to use a scripting language that has modules for HTML parsing or web-scraping and modules for writing XML. But that means writing programs/scripts.

0

You'll find many success stories with the Python module Beautiful Soup, and it is widely recommended for web scraping , which I would categorize this under (if you suggest solutions with regular expressions, you'll quickly get reprimanded by the SU and SO users :-) ). That is what I would have used to scrape your example amazon.com, and I have used it in other contexts.

If you have some very basic Python experience you can probably look at examples and quickly have a working solution. If you have some common programming habit, you can probably do the same with a fraction of more time.

(I don't like when people say "Oh, it is real easy!" when it in practice takes a long time for someone not used to the tool, but I believe Beautiful Soup and Python is a simple and robust solution. If you find a solution that fits you better: great :-) ).


Addendum: what kind of system do you have where all pages are static HTML? Is the data not stored in a database somewhere? I guess not because of your question. This can pose a problem (for any automatic solution) if the HTML is not consistent across the product pages.