I am using HtmlAgilityPack. Is there a one line code that I can get all inner text of html, e.g., remove all html tags and scripts?
            Asked
            
        
        
            Active
            
        
            Viewed 1.7k times
        
    2 Answers
18
            Like this:
document.DocumentNode.InnerText
Note that this will return the text content of <script> tags.
To fix that, you can remove all of the <script> tags, like this:
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    style.Remove();
 
    
    
        SLaks
        
- 868,454
- 176
- 1,908
- 1,964
- 
                    It seems that DocumentNode does not have a function named Descendant? "'HtmlAgilityPack.HtmlNode' does not contain a definition for 'Descendants'" – Yang May 06 '10 at 23:22
- 
                    HTML Agility Pack V1.3.0.0, is it too old? – Yang May 07 '10 at 01:12
- 
                    I used this code for solving a problem of mine. I do have one question though. How can removal be performed in a foreach loop ? – Win Coder Aug 21 '13 at 12:00
- 
                    @WinCoder: What do you mean? – SLaks Aug 21 '13 at 13:28
- 
                    I mean isn't the collection on which foreach loop is being used can't be modified. – Win Coder Aug 21 '13 at 17:06
- 
                    @WinCoder: That's why I call `ToArray()` to iterate on a separate copy. – SLaks Aug 21 '13 at 17:59
- 
                    Ok if you are iterating on a separate copy then how come the original is being modified. – Win Coder Aug 21 '13 at 18:03
1
            
            
        I wrote a simple method. It may help you. This method can extract all specific tag's node. Then you can use the HtmlNodeCollection[i].InnerText to get its text.
    HtmlDocument hDoc;
    HtmlNodeCollection nodeCollection;
    public void InitInstance(string htmlCode) {
        hDoc.LoadHtml(htmlCode);
        nodeCollection = new HtmlNodeCollection();
    }
    private void GetAllNodesInnerTextByTagName(HtmlNode node, string tagName) {
        if (null == node.ChildNodes) {
            return ;
        } else {
            HtmlNodeCollection nCollection = node.SelectNodes( tagName );
            if( null != nCollection ) {
                for( int i=0; i<nCollection.Count; i++) {
                    nodeCollection.Add( nCollection[i]);
                    nCollection[i].Remove();
                }
            }
            nCollection=node.ChildNodes;
            if(null != nCollection) {
                for(int i=0;i<nCollection.Count; i++) {
                    GetAllNodesInnerTextByTagName( nCollection[i] , tagName );
                }
            }
        }
 
    
    
        Leniel Maccaferri
        
- 100,159
- 46
- 371
- 480
 
    
    
        tsingroo
        
- 189
- 1
- 3
 
    