I'm trying to find a way to reliably locate and replace < and > symbols within an HTML/XML formatted string that do not belong to tags.
Basically I start with an HTML string and convert it into something usable by PDFLib, which uses a form of XML to describe documents to be written as PDF's. However if there is a < within in the content it sees it as the opening of a tag and throws a parse exception.
Example input:
<p>This is a test where 6 < 9</p>
<p>This is part of <strong>The same test</strong></p>
<p>This should also work 6<99999</p>
The text surrounding the < is not always numbers, it is user entered and could be anything such as Grade<C, Blue<Red<Green, Test < Test2.... just about anything really
Required output
This is a test where 6 <charref fontname=Helvetica encoding=unicode><<resetfont> 9\n
This is part of <fontname=Helvetica fontstyle=bold encoding=unicode>The same test<resetfont>\n
This should also work 6<charref fontname=Helvetica encoding=unicode><<resetfont>99999\n
I've tried a str_replace and preg_replace, but can't find a solution that will reliably leave the tags alone and replace just the < in context.
Parsing the DOM also seems to fail as the DOMDocument sees the < as an opening tag as well
Using htmlspecialchars on the string converts all the tags <> into <> as well which is no good.
Does anyone have any ideas?