I want to download all the word content from a particular website. Save the results in MS Word, Excel or Notepad and check which words are repeated most often and how many times.
2 Answers
This can be tricky - as you have to download the HTML to get to the rest. Luckily, the problem is already solved. Use Wget. Download (including Windows binaries) here and the manual here
I've given you the manual anchor for the "--accept" option, which limits the types of files saved. You'll need to mix it up with --mirror, and maybe some of the max depth options. Look out for "span hosts" if you get less information than you need.
I think that answers the question as posed - if you want help counting words, (or converting word/excel to text programmatically) that's probably a new question.
- 261
You can use powershell to download the file, then use an HTML parser to extract the text. The powershell command to download a webpage is:
Invoke-WebRequest https://google.com -OutFile C:/Users/JohnDoe/Desktop/google.html
That would save an html file named "google.html" on your desktop (if you change JohnDoe to your windows ID). Then you can use an html parser on it. Here is a link to a wikipedia comparison of html parsers: http://en.wikipedia.org/wiki/Comparison_of_HTML_parsers