1

My goal is to convert this document (https://galileo.phys.virginia.edu/classes/252/lorentztrans.html) which contains math into word document with well formatted equations.

Why only Microsoft Word you ask? I am teaching myself physics from this lecture notes. I make all my notes in onenote (on my Ipad with handwritten equations and hand-drawn diagrams using ipencil). The thing is, OneNote has same equation system as the Microsoft Word. If its converted into Word, then it is converted into OneNote.

I have tried all possibilities I could by Googling. I tried the following methods without success.

Method 1: Copy pasting MathML into MS Word. It is working for some simple equations, I found elsewhere. But strangely, it isn't working for any equations from this website. I think there is something strange about MathML of this website.

Method 2: Converting from HTML to docx using pandoc. I saved the html (only) of this page. Then used pandoc -s input.html -o output.docx. It skipped all the equations.

Method 3: Copypasting directly into MS Word and Apache OpenOffice Write.

I don't mind converting first into intermediate format and then converting it into Word.

NOTE: I am looking for an automatic solution because I need to do it for hundreds of pages. The author has written his lecture notes on various in this format.

claws
  • 4,649

1 Answers1

2

The math tags in the document look like this:

<math xmlns='//www.w3.org/1998/Math/MathML' style='background-color:#'>
 <semantics>
  <mi>v</mi>
 </semantics>
</math>

The XML namespace is given as a protocol-independent URI, i.e., it starts with //. This is not correct, it must use the http: protocol, like so: http://www.w3.org/1998/Math/MathML.

Pandoc gets confused by this as well, since it isn't valid MathML, and so doesn't recognize it as an equation. It works well if one adds the http: prefix. The solution is therefore to do a search-and-replace in the input HTML document, fixing the xmlns attribute, and then pass the fixed result to pandoc.

tarleb
  • 456