How do you convert a Word Document into very simple html in Python?

Question

Every now and then I receive a Word Document that I have to display as a web page. I'm currently using Django's flatpages to achieve this by grabbing the html content generated by MS Word. The generated html is quite messy. Is there a better way that can generate very simple html to solve this issue using Python?

score 6 · Accepted Answer · answered Oct 20 '09 at 20:20

6

A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)

It does so many "clean ups"; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.

This is a known standard for Journalist companies.

answered Oct 20 '09 at 20:20

lprsd

84,407
47
135
168

1

But how exactly do you do it from Google Docs? I upload my MSWord doc and choose the convert option - it loses all diagrams – likejudo Mar 03 '12 at 19:43

score 4 · Answer 2 · answered Jan 10 '12 at 18:29

4

I found this web page: http://www.textfixer.com/html/convert-word-to-html.php

It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.

answered Jan 10 '12 at 18:29

DerVO

3,679
1
23
27

This is freaking amazing! Works exactly as I'd want it to. – Justin Apr 14 '14 at 22:15

score 3 · Answer 3 · answered Oct 21 '09 at 22:50

3

My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:

import urllib
import urllib2

def decruft(html):
    data = urllib.urlencode({'html' : html})
    req = urllib2.Request('http://wordoff.org/api/clean', data)
    response = urllib2.urlopen(req)
    return response.read()

def save(self, **kwargs):
    if not self.pk: # only de-cruft when content is first added
        self.content = decruft(self.content)
    super(FlatPage, self).save(**kwargs)

answered Oct 21 '09 at 22:50

tomd

1,373
1
8
12

WordOff is pretty neat at this kind of thing – Steve Jalim Dec 20 '10 at 10:40
2

If you want to use wordoff locally you can download the module and use its "superClean" method to get the same result: https://raw.github.com/tomdyson/wordoff/master/wordoff.py – Bala Clark Aug 03 '12 at 12:55
5

Hey tomd, WordOff.org expired already though – fedmich Aug 22 '13 at 02:16

score 2 · Answer 4 · answered Oct 20 '09 at 20:31

2

It depends how much formatting and images you're dealing with. I do one of a couple things:

Google Docs: Probably the closest you'll get to the original formatting and usable HTML.
Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.

answered Oct 20 '09 at 20:31

Chris Amico

139
5

1

How do I get the HTML from Google Doc? Is it the Download as HTML option? – Thierry Lam Oct 20 '09 at 20:44
2

+1: Word Doc files are *very* hard to work with. Many tools will convert them, including Open Office. Google Docs has a simple API since it's an HTTP web service. – S.Lott Oct 20 '09 at 21:21
4

MS Word -> HTML is just plain evil. I had a client hand me a 95(!) page word document containing hundreds of 'places to see' and say, "It should be easy to enter this into the database." Arrggghh! I did it and billed him $100/hour for the privilege, but I think I undercharged given the amount of pain. The HTML was flat out the worst I have ever had to work with. – Peter Rowell Oct 21 '09 at 00:14

Etienne · Answer 5 · 2009-10-21T14:04:54.257

2

You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.

I should mention that Abiword can be run on the command line, so it's easy to integrate it in an automated process.

edited Oct 21 '09 at 14:04

answered Oct 21 '09 at 02:54

Etienne

12,440
5
44
50

score 2 · Answer 6 · edited Feb 04 '16 at 17:23

2

Word 2010 has the ability to "save as filtered web page". This will eliminate the overwhelming majority of the HTML that Word inserts.

edited Feb 04 '16 at 17:23

Braiam

1
11
47
78

answered Nov 17 '11 at 21:17

Greg Burdett

191
2
6

How do you convert a Word Document into very simple html in Python?

6 Answers6

Linked