I have a field in my database that holds input from an html input. So I have in my db column data. What I need is to be able to extract this and display a short version of it as an intro. Maybe even the first paragraph if possible.
4 Answers
The Html Agility Pack is usually the recommended way to strip out the HTML. After that it would just be a matter of doing a String.Substring to get the bit of it that you want. 
If you need to get out the 2000 first words I suppose you could either use IndexOf to find a whitespace 2000 times and loop through it until then to get the index to use in the call to Substring.
Edit: Add sample method
public int GetIndex(string str, int numberWanted)
{
    int count = 0;
    int index = 1;
    for (; index < str.Length; index++)
    {
         if (char.IsWhiteSpace(str[index - 1]) == true)
         {
              if (char.IsLetterOrDigit(str[index]) == true ||
                    char.IsPunctuation(str[index]))
              {
                    count++;
                    if (count >= numberWanted)
                         break;
              }
         }
    }
    return index;
}
And call it like:
string wordList = "This is a list of a lot of words";
int i = GetIndex(wordList, 5);
string result = wordList.Substring(0, i);
- 54,199
 - 15
 - 94
 - 116
 
- 
                    I am using the HTML Agility pack and it does strip all the HTML, all I need now would be a code sample to loop thru the string and get the first 2000 words. – Kenyana Jul 05 '10 at 11:18
 - 
                    @Kenyana: Added a sample method for that with a sample for how to call it. Not sure if it's very efficient and might not count completely correctly but should at least give you an idea. – Hans Olsson Jul 05 '10 at 11:32
 - 
                    This is my sample code! I have it within a class which seems to strip and add back the html elements to display on page. But it doesn't limit to the words I want. NB: I got that code from another thread on this site. How do I post code samples on this page? – Kenyana Jul 05 '10 at 12:17
 - 
                    @Kenyana: Doesn't surprise me, I think if you asked a lot of people to do this many would come up with very similar code. Just post the code as text, but prefix each line with 4 spaces. There's a button in the editor that will do it for you if you select all the text first. – Hans Olsson Jul 05 '10 at 12:27
 
Something like this maybe?
    public string Get(string text, int maxWordCount)
    {
        int wordCounter = 0;
        int stringIndex = 0;
        char[] delimiters = new[] { '\n', ' ', ',', '.' };
        while (wordCounter < maxWordCount)
        {
            stringIndex = text.IndexOfAny(delimiters, stringIndex + 1);
            if (stringIndex == -1)
                return text;
            ++wordCounter;
        }
        return text.Substring(0, stringIndex);
    }
It's quite simplified and doesnt handle if multiple delimiters comes after each other (for instance ", "). you might just want to use space as a delimiter.
If you want to get just the first paragraph, simply search after "\r\n\r\n" <-- two line breaks:
    public string GetFirstParagraph(string text)
    {
        int pos = text.IndexOf("\r\n\r\n");
        return pos == -1 ? text : text.Substring(0, pos);
    }
Edit:
A very simplistic way to strip HTML:
return Regex.Replace(text, @”<(.|\n)*?>”, string.Empty);
- 99,844
 - 45
 - 235
 - 372
 
I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:
 Words(string html, int n)
To get n words
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;
        words = StripHtml(html);
        n_words = GetNWords(words, n);
        return n_words;
      }
      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();
        //remove multiple spaces
        //http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);
        foreach (string word in words)
          builder.Append(" " + word);
        return builder.ToString();
      }
      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);
        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();
        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }
        return stringBuilder.ToString();
      }
    }
}
Merry Christmas!
- 4,686
 - 14
 - 57
 - 89
 
Once you have your string you would have to count your words. I assume space is a delimiter for words, so the following code should find the first 2000 words in a string (or break out if there are fewer words).
string myString = "la la la";
int lastPosition = 0;
for (int i = 0; i < 2000; i++)
{
    int position = myString.IndexOf(' ', lastPosition + 1);
    if (position == -1) break;
    lastPosition = position;
}
string firstThousandWords = myString.Substring(0, lastPosition);
You can change indexOf to indexOfAny to support more characters as delimiters.
- 39,181
 - 7
 - 73
 - 79