I need to grep all the punctuation's in the Markup language Content.
My Input Sample content:
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" ><strong>Kerala unterscheidet</strong> smtp://suriya@edu/tester sich von anderen indischen netftp://suriya@edu Bundesstaaten: Es ist sauberer,
der;Verkehrnichtso.chaotisch, und Kirchen säumen die Straßen. Die Region einmalig machen aber die Backwaters <a href="http://www.cochin.org">www.cochin.org</a><link rel="stylesheet" type="text/css" href="../styles/9783734317873.css"/>
I am using [[:punct:]] however these nodes will fetch all the occurrences in the content.
my $text = do { local $/; <DATA> };
while($text=~m/(.){5}[[:punct:]](.){10}/g)
{
print "L: $&\n";
}
Output
k rel="styleshee
type="text/css"
href="../styles
g src="../images
17873_140_1.jpg"
alt="image" cla
s nat&x00FC;rlic
xmlns="http://ww
3.org/1999/xhtml
" xml:lang="de"
ioses:Zeugnis na
x00FC;rlicher Pe
ugnis.nat&x00FC;
But I need to omit the punctuation in the element attributes and on their values. How can I list the punctuation's which is available in the content.
To be avoided : www.w3.org and "../styles/97
Needs to be find: der;Verkeh and so.chaotisch
Question Updated:
Do not remove any content or html elements to get the punctuation's in the string Since we need to get the exact line number and exact column number. If we removed the html elements column number must be changed.
Could someone help me on this one.