How to use newspaper3k python with offline files

Question

I need to get articles/news from a html file and the best solution i found is to use newspaper3k in python. I am getting a blank result, i've tried a lot of solutions but i am a kind of stuck here.

from newspaper import Article
with open("index.html", 'r', encoding='utf-8') as f:
    article = Article('', language='en')
    article.download(input_html=f.read())
    article.parse()
    print(article.title)

Results: ''

It should be print a text from an article tag inside of a html file.

score 1 · Accepted Answer · answered Oct 26 '22 at 11:56

Your code looks right.

I'm going to assume the problem is your source. What is in index.html? Can you provide me the this file or the URL that it was extracted from?

BTW Here is the code sample for reading offline content with newspaper3k. This sample is from my overview document on using newspaper3k.

from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.cnn.com/2020/10/12/health/johnson-coronavirus-vaccine-pause-bn/index.html'
article = Article(base_url, config=config)
article.download()
article.parse()
with open('cnn.html', 'w') as fileout:
    fileout.write(article.html)


# Read the HTML file created above
with open("cnn.html", 'r') as f:
    # note the empty URL string
    article = Article('', language='en')
    article.download(input_html=f.read())
    article.parse()
    
    print(article.title)
    Johnson & Johnson pauses Covid-19 vaccine trial after 'unexplained illness'
    
    article_meta_data = article.meta_data
    
    article_published_date = {value for (key, value) in article_meta_data.items() if key == 'pubdate'}
    print(article_published_date)
    {'2020-10-13T01:31:25Z'}

    article_author = {value for (key, value) in article_meta_data.items() if key == 'author'}
    print(article_author)
    {'Maggie Fox, CNN'}

    article_summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
    print(article_summary)
    {'Johnson&Johnson said its Janssen arm had paused its coronavirus vaccine trial  after an "unexplained illness" in one 
    of the volunteers testing its experimental Covid-19 shot.'}

    article_keywords = {value for (key, value) in article_meta_data.items() if key == 'keywords'}
    print(article_keywords)
    {"health, Johnson & Johnson pauses Covid-19 vaccine trial after 'unexplained illness' - CNN"}

For test purposes, I've downloaded this cnn.html link that you gave me. But the result is the same: empty string. — Raphael Lima, Oct 26 '22 at 15:36
Interesting. I just tested the code and it worked. The only thing that didn't work was the publish date, which was because the source changed the meta tag key. What is your environment? — Life is complex, Oct 26 '22 at 15:43
I am not a programmer. What do you mean with 'enviroment'? Windows 7 / VsCode / Python 3.9 — Raphael Lima, Oct 26 '22 at 15:45
Yes, these items are considered your environment. How did you download the cnn.html source? — Life is complex, Oct 26 '22 at 16:26
I used the chrome browser to save the html document inside the python path. — Raphael Lima, Oct 26 '22 at 17:30
How did you name the html document? I named mine `index_cnn.html` — Life is complex, Oct 26 '22 at 17:37
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249081/discussion-between-raphael-lima-and-life-is-complex). — Raphael Lima, Oct 26 '22 at 17:56

How to use newspaper3k python with offline files

1 Answers1