I am working on a project that evaluates web pages on a large scale (millions). Unlike the typical web crawler, my evaluator needs to build a JDOM object for each page it evaluates in order to run XPath queries on it.
Our current implementation uses SAXBuilder to instantiate JDOM objects (some are then cached for potential future use) in order to query XPaths. But simple profiling shows that the instantiation process actually consumes the most time, much more than querying XPaths and so I am left looking for alternative solutions. Is there any way to reduce this overhead by for example:
- Creating "lean" JDOM objects with only minimal structural information of the page?
 - Evaluating XPaths without an actual JDOM object?
 - Initialising JDOM objects faster by re-using the object for a "similar" page?
 
EDIT:
We are using JDOM 2.X. 
A sample of the way we initialize a JDOM object:
public static List<Element> evaluateHTML(File html, String pattern) throws JDOMException, IOException {
    Element page = saxBuilder.build(html).getRootElement();
    XPathBuilder xpath = new XPathBuilder(pattern, Filters.element());
    //...set xpath namespaces...
    XPathExpression expr = xpath.compileWith(xpathFactory);
    return expr.evaluate(page);
}
Where xpathFactory is a static member and evaluateHTML is invoked for every html file we evaluate.