How ChatGPT’s responses change as top news sites from five countries block it

We analyse how top news brands in Brazil, France, South Africa, the UK and the US approach the chatbot and the possible impact on responses
A smartphone with a displayed ChatGPT logo is placed on a computer motherboard in this illustration taken February 23, 2023. REUTERS/Dado Ruvic/Illustration

A smartphone with a displayed ChatGPT logo on a computer motherboard. REUTERS/Dado Ruvic/Illustration

3rd November 2023

As the debate around the use of news content to train AI models heats up, many news publishers have started blocking ChatGPT from scanning their sites and potentially using their pages for future content. 

According to an open-source survey run by journalist Ben Welsh for his blog palewire, almost half of 1,148 news publishers surveyed have blocked at least one AI crawler from their sites. This includes Google AI, the non-profit Common Crawl and ChatGPT’s OpenAI.

Several voices have criticised these blocking options for not going far enough. A good example is ‘Google-Extended’, a new option that the company says allows publishers to control how their content is used for Google’s AI products. However, Google told SEO publication Search Engine Land that Google-Extended controls do not stop their content from appearing in Google Search Generative Experience, an addition to the search function that uses generative AI to add a short text to accompany Google search results to provide extra context.

Despite concerns over the use of news content for generative AI, not all news sites block these crawlers.

Among those who block OpenAI are several high-profile news organisations based in the UK and US, such as the BBC, the New York Times, CNN and Reuters. But blocking is not as widespread in other parts of the world. According to a spreadsheet coordinated by SEO consultant Gary Kirwan, which logs the AI crawler bots blocked by news sites from several countries, only one among Spain’s large media outlets does – OK Diario.

I wanted to see which news sites were blocking the OpenAI crawler bot and how the chatbot reacted to being asked for content from a blocked website. There are two OpenAI crawler bots currently active: GPTBot and ChatGPT-User. GPTBot is the most widely known – and the most widely blocked. According to OpenAI, the difference is that ChatGPT-User “will only be used to take direct actions on behalf of ChatGPT users and is not used for crawling the web in any automatic fashion.” 

I asked ChatGPT to give me the latest headlines from five online news sources in five countries: the UK, the US, France, Brazil and South Africa. I chose a diverse set of countries and selected the five online news brands with the highest weekly reach according to our Digital News Report 2023. Once I selected the news brands, I checked if they blocked OpenAI’s crawler bots by looking up their robots.txt files. It's important to stress that this is not a piece of academic research and that this analysis only reflects the situation at the time of this writing. 


Out of 24 news brands (BBC News figures twice, both in South Africa and the UK’s top five online news sources), I received full responses for 11, limited or no responses for 11, and responses that were subject to change for two sources. By limited response, I mean when ChatGPT acknowledged it had trouble accessing the website of the news organisation I had named but provided an answer from other sources, such as search results or news aggregator websites.

Results by country 

When checking the news sites in Brazil, I found that ChatGPT was able to list up-to-date headlines from Globo News, O Globo and Jovem Pan News. These websites do not block OpenAI crawler bots. On the other hand, ChatGPT was unable to access the websites of UOL and Record News, which both block GPT-Bot according to their robots.txt files.

A screenshot of a chat exchange with ChatGPT. The question is: "Can you give me the headlines from UOL online?" ChatGPT's answer says it cannot directly access the website due to restrictions.

Moving on to the top online news sources in South Africa, we found that ChatGPT could access headlines from News24, SABC News, eNCA and the Daily Sun. As for BBC News, the fifth news brand with the highest reach in the country, the OpenAI crawler seemed unable to reach its landing page on 3 October. However, the chatbot linked to other pages on the BBC website, including the US and Canada news page, an explainer about Donald Trump’s New York fraud trial and the daily UK newspaper headline review.

A screenshot of a chat exchange with ChatGPT. The question is: "Can you give me the headlines from BBC News online?" ChatGPT's answer says it wasn't able to directly fetch the headlines from the website due to restrictions but provided some headlines from other pages.


Looking at the online news sources with the most reach in the UK, I asked about BBC News again. Despite this being only minutes after asking the same question, “Can you give me the headlines from BBC News online?” in a different conversation, this time the answer was different.

A screenshot of a chat exchange with ChatGPT. The question is: "Can you give me the headlines from BBC News online?" ChatGPT's answer says it wasn't able to directly fetch the headlines from the website due to restrictions but suggested I visit other pages to look up the news.

ChatGPT told me once again that it couldn’t directly access the site due to restrictions. This time, it didn’t give me any headlines sourced from other pages but redirected me to the broadcaster’s hourly radio headlines roundup and to Ground News, a website that collates and compares news from different outlets around the world. The BBC blocks both OpenAI crawler bots.

I received a similar response when I asked about the headlines from the Guardian, which has also said it’s blocking GPTBot. I was told ChatGPT couldn’t retrieve the headlines due to “website access restrictions” and it suggested I should look at Ground News and Front Pages, a website collating images of the front pages of popular newspapers from around the world. 

A screenshot of a chat exchange with ChatGPT. The question is: "Can you give me the headlines from MailOnline?" ChatGPT's answer says it cannot directly access the website but has found 'snippets' on other pages

I then asked for the headlines from MailOnline, which also blocks GPTBot. ChatGPT wasn’t able to directly access the headlines but did suggest and link to some pages I could access on the MailOnline website and to some of the news stories available there. For the remaining two top online news sources in the UK, Sky News and the Telegraph, the chatbot was able to summarise the headlines for me without any problems. Neither of those news sites blocks ChatGPT.

A screenshot of a chat exchange with ChatGPT. The question is: "Can you give me the headlines from 20 Minutes?" The reply says it wasn't able to access the website but offers me 'snippets from the search results'.

 

Unexpected responses

For most news organisations allowing their sites to be crawled by OpenAI’s bots, I could easily obtain up-to-date headlines through GPT-4’s ‘browsing with Bing’ option. It’s unclear why ChatGPT’s responses were so different when explaining why it can’t access a news site directly. Responses varied widely from not giving any information to sourcing snippets of news from other pages, other websites or the search results.

A screenshot of a chat exchange with ChatGPT. The question is: "Can you give me the headlines from MSN online?" ChatGPT's answer says it was unable to retrieve the headlines as the page didn't load properly. The user asks ChatGPT to try again and receives the same response.

 

For MSN News, which does not block ChatGPT’s crawler bots, the chatbot was unable to give me any headlines, saying “The page did not load properly.” I asked the chatbot to try again, with the same result. 

A few days after my first attempts, I tried once again and was once again unsuccessful. This time ChatGPT told me it was “unable to directly retrieve the headlines from MSN News due to a technical limitation.” It was unclear what this limitation was. MSN is a service provided by Microsoft, which is in a long-term partnership with OpenAI.

For three of the news organisations I asked about, ChatGPT was able to give me the headlines and link to the web pages despite their robots.txt files blocking both OpenAI crawler bots. These were France’s BFM TV and South Africa’s News24 and the Daily Sun. 

A screenshot of a chat exchange with ChatGPT. The question is: "Can you give me the headlines from News24?" ChatGPT lists the headlines.

Comparing archived versions of News24’s robots.txt page found on the internet archive Wayback Machine, it appears the news outlet started blocking GPTBot on 21 August. When I asked it for the headlines on 18 October, as well as multiple other times in the preceding days, it was able to give the headlines to me and did not acknowledge any issues accessing the website, in the way it did for other sites that blocked its bots. 

On 2 November, however, the answer to the same question included this caveat: “I was unable to retrieve the headlines from News24 due to restrictions on the website,” and the answer was “based on the search results” instead of the website itself. 

According to palewire’s data captures, GPTBot was first listed as not allowed to crawl BFM TV’s site on 1 September. As with News24, though, I received a full response to multiple queries for the headlines over a month after GPTBot’s appearance on the page’s robots.txt file. Similarly, the response has now changed and ChatGPT could no longer report the headlines at the time of this writing on 2 November 2023.

As for the Daily Sun, its URL seems to have been recently changed, making it more difficult to track when its blocking of OpenAI’s bots began. At the time of this writing, however, I am still able to get the headlines from ChatGPT, which linked to the new website’s home page.

A screenshot of a chat exchange with ChatGPT. The question is: "Can you give me the headlines from the Daily Sun in South Africa?" The answer lists the headlines.

It’s important to stress these are only observations based on a limited test and we’re not in a position to draw any general conclusions. As I said, this is not a piece of academic research.

Overall, there are some differences in the extent to which news organisations block ChatGPT. For example, MailOnline, the Guardian, CNN and the New York Times block GPTBot but not ChatGPT-User. Other outlets such as the BBC block both bots. 

Another factor that could have affected the variety of responses I received is that the content of global news organisations is shared and reproduced on news aggregators like the ones linked above, giving ChatGPT another way to access their content. 

As with most things related to generative AI, we don’t know how the process works right now, and how it will evolve in the near future, as further developments are implemented. Many news organisations will be watching how this technology changes and what benefits (if any) it provides for them before taking action.