0

I have a large XML file (about 37 MB—it’s the underlying XML file in a Word document of about 350 pages) that I am trying to search through with XPath. I’m doing this ‘manually’ rather than programmatically, by opening the file in an XML editor and searching there.

I’ve tried with Xmplify, QXmlEdit, SublimeText with the XPath extension, etc., and they all suffer from the same problem: just opening the file is ridiculously slow and hogs an awful lot of memory, and doing an XPath search is nigh impossible.

As an example, I just tried opening the file in Xmplify. That took about three minutes, and with no other documents open Xmplify’s memory usage rose to about 1 GB.

Then I tried to perform this XPath query (I’m looking for all tracked insertions consisting of the string en):

//w:ins[w:r/w:t = 'en']

That gave me a SPOD for a good while. After about 15 minutes of going at around 100% CPU, Xmplify was now using 60 GB of memory, and my OS was telling me that I had run out of application memory and needed to start force-quitting stuff.

That seems rather excessive to me for a single XPath query in a single file, even if it is a fairly big file. The other applications I’ve tried using were not as egregiously bad, but opening the document and running any kind of XPath query still took minutes, and their memory usages were countable in GB too, so it’s not just Xmplify being inefficient.

What is the reason for this? Why is XPath (apparently) so resource-intensive? Does it differ between OSes (mine is macOS Sierra)?

 

I debated whether to post this here or on StackOverflow, but since I’m specifically not doing this programmatically, I decided this was probably the better place. Feel free to migrate if there’s a better Stack for it.

1 Answers1

0

One major factor is that your XPath is starting with // so every XML element of the whole document will be checked for the predicate you gave - it is the same as if you wrote /descendant-or-self::*. If you could narrow down which elements could be relevant, e.g. by providing an absolute XPath like /w:document/w:body/w:p/w:r/w:t/w:ins[w:r/w:t = 'en']t this would most likely speed up your search massively. Even if you cannot name all hierarchy levels but need w:* somewhere, it's most likely still much faster.

FYI, notepad++ with XML Plugin also offers limited XPath support, so you may try that one.

FYI, if you have access to an inubit installation (because it's an enterprise service bus / business process engine it is only licensed by organisations like companies but not individuals), it is IMHO worth a shot. You shall change the Workbench setting editor options > max file size to 40MB so your file will be nicely rendered and then load your XML file into the XML editor and trigger XPath based search.

FYI, Saxon is rather fast in my experience, so it may be worth to create an XSLT for searching. You can call it e.g. via java.exe -jar saxon-he-10.5.jar -xsl:searchScript.xsl -s:input.xml -o:output.xml and within searchScript.xsl provide a simple XSLT that will copy any found element to output.xml and add the XPath to the search result (if that's too big, just remove location="{fn:path()}"), e.g.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:fn="http://www.w3.org/2005/xpath-functions" exclude-result-prefixes="#all" version="3.0">  
  <xsl:output method="xml" encoding="UTF-8"/>  
  <xsl:template match="/">
    <results>
      <xsl:for-each select="//w:ins[w:r/w:t = 'en']">
        <result location="{fn:path()}">
          <xsl:copy-of select=". "/>
        </result>
      </xsl:for-each>
    </results> 
  </xsl:template> 
</xsl:stylesheet>
Georg
  • 71