How ChatGPT’s responses change as top news sites from five countries block it
As the debate around the use of news content to train AI models heats up, many news publishers have started blocking ChatGPT from scanning their sites and potentially using their pages for future content.
According to an open-source survey run by journalist Ben Welsh for his blog palewire, almost half of 1,148 news publishers surveyed have blocked at least one AI crawler from their sites. This includes Google AI, the non-profit Common Crawl and ChatGPT’s OpenAI.
Several voices have criticised these blocking options for not going far enough. A good example is ‘Google-Extended’, a new option that the company says allows publishers to control how their content is used for Google’s AI products. However, Google told SEO publication Search Engine Land that Google-Extended controls do not stop their content from appearing in Google Search Generative Experience, an addition to the search function that uses generative AI to add a short text to accompany Google search results to provide extra context.
Despite concerns over the use of news content for generative AI, not all news sites block these crawlers.
Among those who block OpenAI are several high-profile news organisations based in the UK and US, such as the BBC, the New York Times, CNN and Reuters. But blocking is not as widespread in other parts of the world. According to a spreadsheet coordinated by SEO consultant Gary Kirwan, which logs the AI crawler bots blocked by news sites from several countries, only one among Spain’s large media outlets does – OK Diario.
I wanted to see which news sites were blocking the OpenAI crawler bot and how the chatbot reacted to being asked for content from a blocked website. There are two OpenAI crawler bots currently active: GPTBot and ChatGPT-User. GPTBot is the most widely known – and the most widely blocked. According to OpenAI, the difference is that ChatGPT-User “will only be used to take direct actions on behalf of ChatGPT users and is not used for crawling the web in any automatic fashion.”
I asked ChatGPT to give me the latest headlines from five online news sources in five countries: the UK, the US, France, Brazil and South Africa. I chose a diverse set of countries and selected the five online news brands with the highest weekly reach according to our Digital News Report 2023. Once I selected the news brands, I checked if they blocked OpenAI’s crawler bots by looking up their robots.txt files. It's important to stress that this is not a piece of academic research and that this analysis only reflects the situation at the time of this writing.
Out of 24 news brands (BBC News figures twice, both in South Africa and the UK’s top five online news sources), I received full responses for 11, limited or no responses for 11, and responses that were subject to change for two sources. By limited response, I mean when ChatGPT acknowledged it had trouble accessing the website of the news organisation I had named but provided an answer from other sources, such as search results or news aggregator websites.
Results by country
When checking the news sites in Brazil, I found that ChatGPT was able to list up-to-date headlines from Globo News, O Globo and Jovem Pan News. These websites do not block OpenAI crawler bots. On the other hand, ChatGPT was unable to access the websites of UOL and Record News, which both block GPT-Bot according to their robots.txt files.
Moving on to the top online news sources in South Africa, we found that ChatGPT could access headlines from News24, SABC News, eNCA and the Daily Sun. As for BBC News, the fifth news brand with the highest reach in the country, the OpenAI crawler seemed unable to reach its landing page on 3 October. However, the chatbot linked to other pages on the BBC website, including the US and Canada news page, an explainer about Donald Trump’s New York fraud trial and the daily UK newspaper headline review.
Looking at the online news sources with the most reach in the UK, I asked about BBC News again. Despite this being only minutes after asking the same question, “Can you give me the headlines from BBC News online?” in a different conversation, this time the answer was different.
ChatGPT told me once again that it couldn’t directly access the site due to restrictions. This time, it didn’t give me any headlines sourced from other pages but redirected me to the broadcaster’s hourly radio headlines roundup and to Ground News, a website that collates and compares news from different outlets around the world. The BBC blocks both OpenAI crawler bots.
I received a similar response when I asked about the headlines from the Guardian, which has also said it’s blocking GPTBot. I was told ChatGPT couldn’t retrieve the headlines due to “website access restrictions” and it suggested I should look at Ground News and Front Pages, a website collating images of the front pages of popular newspapers from around the world.
I then asked for the headlines from MailOnline, which also blocks GPTBot. ChatGPT wasn’t able to directly access the headlines but did suggest and link to some pages I could access on the MailOnline website and to some of the news stories available there. For the remaining two top online news sources in the UK, Sky News and the Telegraph, the chatbot was able to summarise the headlines for me without any problems. Neither of those news sites blocks ChatGPT.
Unexpected responses
For most news organisations allowing their sites to be crawled by OpenAI’s bots, I could easily obtain up-to-date headlines through GPT-4’s ‘browsing with Bing’ option. It’s unclear why ChatGPT’s responses were so different when explaining why it can’t access a news site directly. Responses varied widely from not giving any information to sourcing snippets of news from other pages, other websites or the search results.
For MSN News, which does not block ChatGPT’s crawler bots, the chatbot was unable to give me any headlines, saying “The page did not load properly.” I asked the chatbot to try again, with the same result.
A few days after my first attempts, I tried once again and was once again unsuccessful. This time ChatGPT told me it was “unable to directly retrieve the headlines from MSN News due to a technical limitation.” It was unclear what this limitation was. MSN is a service provided by Microsoft, which is in a long-term partnership with OpenAI.
For three of the news organisations I asked about, ChatGPT was able to give me the headlines and link to the web pages despite their robots.txt files blocking both OpenAI crawler bots. These were France’s BFM TV and South Africa’s News24 and the Daily Sun.
Comparing archived versions of News24’s robots.txt page found on the internet archive Wayback Machine, it appears the news outlet started blocking GPTBot on 21 August. When I asked it for the headlines on 18 October, as well as multiple other times in the preceding days, it was able to give the headlines to me and did not acknowledge any issues accessing the website, in the way it did for other sites that blocked its bots.
On 2 November, however, the answer to the same question included this caveat: “I was unable to retrieve the headlines from News24 due to restrictions on the website,” and the answer was “based on the search results” instead of the website itself.
According to palewire’s data captures, GPTBot was first listed as not allowed to crawl BFM TV’s site on 1 September. As with News24, though, I received a full response to multiple queries for the headlines over a month after GPTBot’s appearance on the page’s robots.txt file. Similarly, the response has now changed and ChatGPT could no longer report the headlines at the time of this writing on 2 November 2023.
As for the Daily Sun, its URL seems to have been recently changed, making it more difficult to track when its blocking of OpenAI’s bots began. At the time of this writing, however, I am still able to get the headlines from ChatGPT, which linked to the new website’s home page.
It’s important to stress these are only observations based on a limited test and we’re not in a position to draw any general conclusions. As I said, this is not a piece of academic research.
Overall, there are some differences in the extent to which news organisations block ChatGPT. For example, MailOnline, the Guardian, CNN and the New York Times block GPTBot but not ChatGPT-User. Other outlets such as the BBC block both bots.
Another factor that could have affected the variety of responses I received is that the content of global news organisations is shared and reproduced on news aggregators like the ones linked above, giving ChatGPT another way to access their content.
As with most things related to generative AI, we don’t know how the process works right now, and how it will evolve in the near future, as further developments are implemented. Many news organisations will be watching how this technology changes and what benefits (if any) it provides for them before taking action.