I've used spaCy to tokenize the text.
First install spaCy and the spaCy model we will use:
pip install spacy
python -m spacy download en_core_web_sm
It's quite straightforward. We get the web page, concatenate all the text within the <p> elements (ignoring the header and footer), let spaCy do its thang, then remove the non-word tokens before finally giving it to Counter to count the words.
The word counts are in counts. Look at all the print calls to see how to access counts.
import requests
import bs4
import spacy
from collections import Counter
url = "https://www.marxists.org/subject/art/literature/children/texts/orwell/animal-farm/ch01.htm"
page_content = requests.get(url).content
soup = bs4.BeautifulSoup(page_content, "lxml")
text = ""
for paragraph in soup.find_all("p"):
# We probably don't want text within the header and footer paragraphs
if paragraph.attrs.get("class", (None,))[0] in ("title", "footer"):
continue
text += paragraph.get_text().lower() # It's best to keeps things in one case
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
# Not all tokens are words, so we exclude some
words = tuple(token.text for token in doc if not (token.is_punct or token.is_space or
token.is_quote or token.is_bracket))
counts = Counter(words)
print("Word count:", len(words)) # Or sum(counts.values())
print("Unique word count:", len(counts))
print("15 most common words:")
for i, (word, count) in enumerate(counts.most_common(15), start=1):
print(f"{i: >2}. {count: >3} - {word}")
print("The word 'animal' occurs:", counts["animal"])
print("The word 'python' occurs:", counts["python"])
print("All words and their count:")
for word, count in counts.items():
print(f"{count}, {word}")
Output:
Word count: 2704
Unique word count: 849
15 most common words:
1. 169 - the
2. 98 - and
3. 93 - of
4. 59 - to
5. 51 - a
6. 44 - in
7. 44 - that
8. 42 - it
9. 34 - i
10. 34 - is
11. 33 - was
12. 31 - had
13. 31 - he
14. 27 - you
15. 24 - all
The word 'animal' occurs: 11
The word 'python' occurs: 0
All words and their count:
4, mr
8, jones
93, of
[...]
1, birds
1, jumped
1, perches