I am trying to download a site for offline viewing and this is requiring me to do a number of DOM manipulations (trust me, wget just is not doing what I need to do...).
I am finding that webpages containing tags with unusual text content is throwing saveHTML off.
For some url, if I use curl to read the page and output as
echo $contents;
then all is well.
For instance, there is a section of the page containing the following source:
<div id="area2516" class="component interaction_component float-none clear-none ">
    <div id="area2516">
        <script type="text/javascript">
            window.bm = window.bm || {};
            bm.data = bm.data || [];
            bm.data['area2516'] = {};
        </script>
        <link rel="stylesheet" type="text/css" href="/somecss.css">
        <script type="text/javascript" src="somejs.js">
        </script>
    <script class="main-template" type="text/x-handlebars-template">
            <div class="content_area">
                <div class="bg_image cf"></div>
                    {{#each rollovers}}
                <div class="rollover_content" style="left: {{x}}; top: {{y}}; display: none;" data-rollover-id="{{id}}">
                    {{{this.content}}}
                </div>
                {{/each}}
                </div>
                <div class="rollover_links">
                    <ul>
                        {{#each rollovers}}
                        <li>
                            <a class="rollover_link" href="#" data-rollover-id="{{id}}">
                                {{{link}}}
                            </a>
                        </li>
                        {{/each}}
                    </ul>
                </div>
        </script>
        <script type="text/javascript">
            bm.data['area2516'].assets = {};
            bm.data['area2516'].initial_json = '';
        </script>
as seen from the above echo following the curl response.
Now, if I do this
$doc = new DOMDocument();
@$doc->loadHTML($contents);
$xpath = new DOMXpath($doc);
echo $doc->saveHTML();
the HTML gets messed up, such that above now becomes this:
<div id="area2516" class="component interaction_component float-none clear-none ">
<div id="area2516">
    <script type="text/javascript">
        window.bm = window.bm || {};
        bm.data = bm.data || [];
        bm.data['area2516'] = {};
    </script>
    <link rel="stylesheet" type="text/css" href="/somecss.css"> . 
    <script type="text/javascript" src="/somejs.js"></script>
    <script class="main-template" type="text/x-handlebars-template">
        <div class="content_area">
            <div class="bg_image cf">
    </script>
            </div>
            {{#each rollovers}}
            <div class="rollover_content" style="left: {{x}}; top: {{y}}; display: none;" data-rollover-id="{{id}}">
              {{{this.content}}}
            </div>
          {{/each}}
        </div>
        <div class="rollover_links">
          <ul>
            {{#each rollovers}}
              <li>
                <a class="rollover_link" href="#" data-rollover-id="{{id}}">
                  {{{link}}}
                </a>
              </li>
            {{/each}}
          </ul></div>
<script type="text/javascript">
        bm.data['area2516'].assets = {};
        bm.data['area2516'].initial_json = '';
      </script>
Sorry about the formatting, this new editor is pretty annoying. The point is, you can see some pretty major differences, and I am not sure how saveHTML is causing this modification to the source. I suspect it had something to do with encoding and the peculiarity of these double and triple braces used by the templating system, but despite attempts to use various encoding parameters, I am getting the same result. Then I thought maybe has something to do with special chars, escaping, but I am just not sure what function(s) are needed to stop saveHTML from messing up the output.
Ideas?
Thanks
 
     
    