how to find punctuation's in the string using perl

Question

I need to grep all the punctuation's in the Markup language Content.

My Input Sample content:

__DATA__

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" ><strong>Kerala unterscheidet</strong> smtp://suriya@edu/tester sich von anderen indischen netftp://suriya@edu Bundesstaaten: Es ist sauberer, der;Verkehr nicht so.chaotisch, und Kirchen säumen die Straßen. Die Region einmalig machen aber die Backwaters <a href="http://www.cochin.org">www.cochin.org</a><link rel="stylesheet" type="text/css" href="../styles/9783734317873.css"/>

I am using [[:punct:]] however these nodes will fetch all the occurrences in the content.

my $text = do { local $/; <DATA> };

while($text=~m/(.){5}[[:punct:]](.){10}/g)
{
    print "L: $&\n";
}

Output

k rel="styleshee  
 type="text/css"
 href="../styles
g src="../images
17873_140_1.jpg"
 alt="image" cla
s nat&x00FC;rlic
xmlns="http://ww
3.org/1999/xhtml
" xml:lang="de"
ioses:Zeugnis na
x00FC;rlicher Pe
ugnis.nat&x00FC;

But I need to omit the punctuation in the element attributes and on their values. How can I list the punctuation's which is available in the content.

To be avoided : www.w3.org and "../styles/97 Needs to be find: der;Verkeh and so.chaotisch

Question Updated:

Do not remove any content or html elements to get the punctuation's in the string Since we need to get the exact line number and exact column number. If we removed the html elements column number must be changed.

Could someone help me on this one.

Any example related to the questions pls? I didn't used to this module. — ssr1012, Feb 09 '20 at 15:32

score 2 · Answer 1 · answered Feb 10 '20 at 09:31

There is a great answer explaining why you shouldn't try to parse HTML with regex - https://stackoverflow.com/a/1732454/939457

You can use HTML::Parse and HTML::FormatText to extract the text:

 perl -MHTML::Parse -MHTML::FormatText -0777 -ne \
    'print HTML::FormatText->new->format(parse_html($_))' sample.txt

You will get only the text:

Kerala unterscheidet smtp://suriya@edu/tester sich von anderen indischen
   netftp://suriya@edu Bundesstaaten: Es ist sauberer, der;Verkehr nicht
   so.chaotisch, und Kirchen säumen die Straßen. Die Region einmalig
   machen aber die Backwaters www.cochin.org

Then you can use your original code. Something like this should work:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parse;
use HTML::FormatText;

my $text = do { local $/; <DATA> };

$text = HTML::FormatText->new(leftmargin=>0, rightmargin=>100000000000)->format(parse_html($text));

while($text=~m/(.){5}[[:punct:]](.){10}/g)
{
        print "L: $&\n";
}

__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" ><strong>Kerala unterscheidet</strong> smtp://suriya@edu/tester sich von anderen indischen netftp://suriya@edu Bundesstaaten: Es ist sauberer, der;Verkehr nicht so.chaotisch, und Kirchen säumen die Straßen. Die Region einmalig machen aber die Backwaters <a href="http://www.cochin.org">www.cochin.org</a><link rel="stylesheet" type="text/css" href="../styles/9783734317873.css"/>

Note: leftmargin / rightmargin are set to prevent the text wrapping done by the HTML::FormatText module

Thanks for your answer. I couldn't remove any content itself of the inputs. Since I need to check the line and column number. If we removed the content we missed the exact column number for the output where the punctuation's in the content. — ssr1012, Feb 10 '20 at 12:52
please specify such irregular conditions in your question, so people do not waste time trying to help. — Sorin, Feb 10 '20 at 13:43

how to find punctuation's in the string using perl

1 Answers1