RegEx for matching everything except new lines and a special char

Question

I was working on a HW problem that involves removing all of the html tags "<...>" from the text of an html code and then count all of the tokens in that text.

I wrote a solution that works but it all comes down to a single line of code that I didn't actually write and I'm curious to learn more about how this kind of code works.

public static int tagStrip(Scanner in) {
     int count = 0; 

     while(in.hasNextLine()) {
         String line = in.nextLine();

         line = line.replaceAll("<[^>\r\n]*>", "");

         Scanner scan = new Scanner(line);

         while(scan.hasNext()) {
            String word = scan.next();
            count++;
         }
     }
     return count;
}

Line 7 is the one I'm curious about. I understand how the replaceAll() method works. I'm not sure how that String "<[^>\r\n]*>" works. I read a little bit about patterns and messed around with it a bit.
I replaced it with "<[^>]+>" and it still works exactly the same. So I was hoping somebody could explain how these characters work and what they do especially within the construct of this type of program.

`"<[^>\r\n]*>` A negative class turns it's items into an _AND_. So, it will stop matching if it finds a `>` or a `\r` or a `\n`, basically won't span lines. `<[^>]+>` will span lines since `\r\n` is removed. — , May 18 '19 at 20:42
The regex you should be using though is this `<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>` https://regex101.com/r/ZE9Ayg/1 — , May 18 '19 at 20:44

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

0

RegEx

If you wish to explore or modify your expression, you can modify/change your expressions in regex101.com.

<[^>]+> may not work since it would pass your new lines, which seems to be undesired.

RegEx Circuit

You can also visualize your expressions in jex.im:

edited Jun 20 '20 at 09:12

Community

1
1

answered May 18 '19 at 20:45

Emma

27,428
11
44
69

RegEx for matching everything except new lines and a special char

1 Answers1

RegEx

RegEx Circuit