Locating a < symbol in HTML that isn't part of a tag

Question

I'm trying to find a way to reliably locate and replace < and > symbols within an HTML/XML formatted string that do not belong to tags.

Basically I start with an HTML string and convert it into something usable by PDFLib, which uses a form of XML to describe documents to be written as PDF's. However if there is a < within in the content it sees it as the opening of a tag and throws a parse exception.

Example input:

<p>This is a test where 6 < 9</p>
<p>This is part of <strong>The same test</strong></p>
<p>This should also work 6<99999</p>

The text surrounding the < is not always numbers, it is user entered and could be anything such as Grade<C, Blue<Red<Green, Test < Test2.... just about anything really

Required output

This is a test where 6 <charref fontname=Helvetica encoding=unicode>&lt;<resetfont> 9\n
This is part of <fontname=Helvetica fontstyle=bold encoding=unicode>The same test<resetfont>\n
This should also work 6<charref fontname=Helvetica encoding=unicode>&lt;<resetfont>99999\n

I've tried a str_replace and preg_replace, but can't find a solution that will reliably leave the tags alone and replace just the < in context.

Parsing the DOM also seems to fail as the DOMDocument sees the < as an opening tag as well

Using htmlspecialchars on the string converts all the tags <> into <> as well which is no good.

Does anyone have any ideas?

You should use `<>` in the rendered HTML. Why can't you do it this way? — cheesemacfly, May 29 '13 at 15:27
Try the answer to this question: http://stackoverflow.com/questions/3797100/how-to-repair-malformed-xml — StampyCode, May 29 '13 at 15:31
@cheesemacfly because it won't be rendered HTML.. its going to be converted into a form of XML and used to generate a PDF — fullybaked, May 29 '13 at 15:35
Why can't you replace `<` with `<` once you have used `htmlspecialchars`? I might be missing something but I don't see the issue. — cheesemacfly, May 29 '13 at 15:46
@cheesemacfly `6 < 7` becomes `<strong>6 < 7</strong>` which when replacing the `<` with the PDFLib code breaks the `strong` tag completely — fullybaked, May 29 '13 at 16:00
Ok, see your point. I though you were running `htmlspecialchars` only on the content and not on the whole HTML — cheesemacfly, May 29 '13 at 16:02
@TimStamp put that as an answer if you want and I'll accept it. I used the Tidy lib to parse the html first and it worked. Thanks — fullybaked, May 29 '13 at 16:30

score 1 · Answer 1 · answered May 29 '13 at 15:30

1

try reading the string from start char by char if it encounters a < push it in a buffer if > is found without a space then its a tag else if it encounter a < again mark the previous as < and put next in buffer ... and repeat until the end of string

answered May 29 '13 at 15:30

Originative

21
1

you'll also have to push quotes into the stack, to avoid situations like `` – StampyCode May 29 '13 at 15:34
Tried this out and it did work, but the `Tidy` suggestion by @TimStamp was a cleaner solution – fullybaked May 29 '13 at 16:30

score 1 · Accepted Answer · edited May 23 '17 at 12:28

1

Try using the answer from this question:

how to repair malformed xml

I tried to add this as as it stands, but StackOverflow requires me to add some description to the answer, or it automatically gets converted into a comment, which can't be accepted as an answer.

edited May 23 '17 at 12:28

Community

1
1

answered May 29 '13 at 16:39

StampyCode

7,218
3
28
44

score 0 · Answer 3 · answered May 29 '13 at 16:10

While it's no longer maintained, I think the php port of html5lib is probably your best bet for parsing bad markup.

A simple call like this:

require_once 'your-path-path-to-html5lib/Parser.php';
$dom = HTML5_Parser::parse($input);

will take bad markup in $input and return a valid php DOMDocument.

From there you can save it back to a string with $dom->saveHTML() or $dom->saveXML, or extract the bits you want with the DOM API.

Note that this will produce a full HTML document with head and body etc. even if your original data didn't include that.

If you just want to parse an HTML fragment, you can do:

$dom = HTML5_Parser::parseFragment($input);

which will return a DOMNodeList.

score 0 · Answer 4 · answered May 29 '13 at 16:19

0

HTML entities are the best way to do such things <> are the entities used to replace <> in HTML. Even using the <code> tag. You can use these entities and replace them with <> in your HTML Tags. www.w3schools.com/html/html_entities.asp

answered May 29 '13 at 16:19

lakshya_arora

791
5
18

Locating a < symbol in HTML that isn't part of a tag

Example input:

Required output

4 Answers4