I have a Clojure program that is consuming a large amount of heap while running (I once measured it at somewhere around 2.8GiB), and I'm trying to find a way to reduce its memory footprint. My current plan is to force garbage collection every so often, but I'm wondering if this is a good idea. I've read How to force garbage collection in Java? and Can I Force Garbage Collection in Java? and understand how to do it — just call (System/gc) — but I don't know if it's a good idea, or even if it's needed.
Here's how the program works. I have a large number of documents in a legacy format that I'm trying to convert to HTML. The legacy format consists of several XML files: a metadata file that describes the document, and contains links to any number of content files (usually one, but it can be several — for example, some documents have "main" content and footnotes in separate files). The conversion takes anywhere from a few milliseconds for the smallest documents, to about 58 seconds for the largest document. Basically, I'm writing a glorified XSLT processor, though in a much nicer language than XSLT.
My current (rather naïve) approach, written when I was just starting out in Clojure, builds a list of all the metadata files, then does the following:
(let [parsed-trees (map parse metadata-files)]
  (dorun (map work-func parsed-trees)))
work-func converts the files to HTML and writes the result to disk, returning nil. (I was trying to throw away the parsed-XML trees for each document, which is quite large, after each pass through a single document). I now realize that although map is lazy and dorun throws away the head of the sequence it's iterating over, the fact that I was holding onto the head of the seq in parsed-trees is why I was failing.
My new plan is to move the parsing into work-func, so that it will look like:
(defn work-func [metadata-filename]
  (-> metadata-filename
      e/parse
      xml-to-html
      write-html-file)
  (System/gc))
Then I can call work-func with map, or possibly pmap since I have two dual-core CPUs, and hopefully throw away the large XML trees after each document is processed.
My question, though, is: is it a good idea to be telling Java "please clean up after me" so often? Or should I just skip the (System/gc) call in work-func, and let the Java garbage collector run when it feels the need to? My gut says to keep the call in, because I know (as Java can't) that at that point in work-func, there is going to be a large amount of data on the heap that can be gotten rid of, but I would welcome input from more experienced Java and/or Clojure coders.
 
     
    