I have some badly formatted XML that I must parse. Fixing the problem upstream is not possible.
The (current) problem is that ampersand characters are not always escaped properly, so I need to convert & into &
If & is already there, I don't want to change it to &amp;. In general, if any well-formed entity is already there, I don't want to destroy it. I don't think that it's possible, in general, to know all entities that could appear in any particular XML document, so I want a solution where anything like &<characters>; is preserved.
Where <characters> is some set of characters defining an entity between the initial & and the closing ;. In particular, < and > are not literals that would otherwise denote an XML element.
Now, when parsing, if I see &<characters> I don't know whether I'll run into a ;, a (space), end-of-line, or another &. So I think that I have to remember <characters> as I look ahead for a character that will tell me what to do with the original &.
I think that I need the power of a Push Down Automaton to do this, I don't think that a Finite State Machine will work because of what I think is a memory requirement - is that correct? If I need a PDA, then a regular expression in a call to String.replaceAll(String, String) won't work. Or is there a Java regex that can solve this problem?
Remember: there could be multiple replacements per line.
(I'm aware of this question, but it does not provide the answer that I am looking for.)