i am looking for a method to extract text from web page (initially html) using jdk or another library . please help
thanks
Use jsoup. This is currently the most elegant library for screen scraping.
URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();
I just love its CSS selector syntax.
 
    
    Use a HTML parser if at all possible; there are many available for Java.
Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.
Text extraction:
Tag stripping:
 
    
     
    
    Here's a short method that nicely wraps these details (based on java.util.Scanner):
public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}
And this is how it is used:
public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}
