3 reasons not to block GPTBot from crawling your site

The next phase in ChatGPT’s meteoric rise is the adoption of GPTBot. This new iteration of OpenAI’s technology involves crawling webpages to deepen the output ChatGPT can provide.

AI improvement seems positive, but it’s not so clear-cut. Legal and ethical issues surround the technology.

GPTBot’s arrival has highlighted these concerns, as many major brands are blocking it instead of leveraging its potential.

But I truly believe there’s much more to gain than lose by fully (and responsibly) embracing GPTBot.

Why do AI bots like GPTBot crawl websites?

Understanding why bots like GPTBot do what they do is the first step to embracing this technology and leveraging its potential.

Simply put, bots like GPTBot are crawling websites to gather information. The main difference is rather than an AI platform passively being fed data to learn from (the “training set,” if you will), a bot can actively pursue information on the web by crawling various pages.

Large language models (LLMs) scour these websites in an attempt to understand the world around us. Google’s C4 data set makes up a large portion (15.7 million sites) of the learning body for these LLMs. They also crawl other authoritative, informative sites like Wikipedia and Reddit.

The more sites these bots can crawl, the more they learn and the better they can become. Why, then, are companies blocking GPTBot from crawling?

Do brands that block GPTBot have valid fears?

When I first read about companies blocking GPTBot from crawling their websites, I was confused and surprised.

To me, it seemed incredibly short-sighted. But I figured there must be a lot to consider that I wasn’t thinking deeply enough about.

After researching and talking to agency professionals with legal backgrounds, I found the biggest reasons.

Lack of compensation for their proprietary training data

Many brands block GPTBot from crawling their site because they don’t want their data used in training its models without compensation. While I can understand wanting a piece of their $1 billion pie, I think this is a short-sighted view.

ChatGPT, much like Google and YouTube, is an answer engine for the world. Preventing your content from being crawled by GPTBot might limit your brand’s reach to a smaller set of internet users in the future.

Security concerns

Another reason behind the anti-GPTBot sentiment is security. While more valid than greedily hoarding data, it’s still a largely unfounded concern from my perspective.

By now, all websites should be very secure. Not to mention, the content GPTBot is trying to access is public, non-sensitive content. The same stuff that Google, Bing, and other search engines are crawling daily.

What caches of sensitive information do CIOs, CEOs, and other company leaders think GPTBot will access during its crawl? And with the right security measures, shouldn’t this be a non-issue?

The looming threat of legal implications

From a legal standpoint, the argument is that any crawls done on a brand’s site must be covered by their privacy disclaimer. All websites should have a privacy disclaimer outlining how they use the data collected by their services. Attorneys say this language must also state that a generative AI third-party platform could crawl the data collected.

If not, any personally identifiable information (PII) or customer data could still be “public” and expose brands to a Section 5 Federal Trade Commission (FTC) claim for unfair and deceptive trade practices.

I get this concern to some degree. If you’re the legal department of a big-name brand, one of your primary objectives is to keep your company out of hot water. But this legal concern applies more to what’s input into ChatGPT rather than what GPTBot crawls.

Anything input into OpenAI’s platform becomes part of its data bank and has the potential to be shared with other users – leading to data leakage. However, this would likely only happen if users asked questions relative to stored information.

This is another unwarranted concern to me because it can all be resolved by responsible internet usage. The same data principles we’ve used since the dawn of the web still ring true – don’t input any information you don’t want shared.

An impulse to save humanity from AI advancement

I can’t help but think that leaders at some of these brands blocking GPTBot have a bias against the advancement of AI technology.

We often fear what we don’t understand, and some are frightened by the idea of artificial intelligence gaining too much knowledge and becoming too powerful.

While AI is evolving rapidly and beginning to “think” more deeply, humans are still largely in control. Additionally, legislation governing AI will grow alongside the technology.

When we finally reach a world of “autonomous” AI platforms, their functionality will be guided by years of human innovation and legislation.

Get the daily newsletter search marketers rely on.

Business email address

See terms.

3 reasons not to block ChatGPT’s GPTBot

So why should you allow GPTBot to crawl your site? Let’s look on the bright side with these three primary benefits of embracing OpenAI’s bot technology.

1. 100 million people use ChatGPT each week

By not allowing GPTBot to crawl your site, there’s a 100 million-person audience you’re missing out on maximizing brand visibility.

Sharing access to your website content can help ensure your brand is both factually and positively represented to ChatGPT users.

This means there’s a higher chance that your brand will actually be recommended by ChatGPT, leading to more traffic and potential customers.

Some brands report getting 5% of their overall leads, or $100,000 in monthly subscription revenue from ChatGPT. I know our agency has already gotten some leads from ChatGPT, too.

Another way to consider this is as a positive digital PR (DPR) play. You should leverage DPR strategies like brand mention campaigns in today’s landscape.

Permitting GPTBot to crawl your site only adds to these efforts by allowing ChatGPT to access your brand information directly from the source and distribute it to 100 million users positively.

2. Generative engine optimization (GEO)

Whether you have fears about AI, we can all agree that it’s changing the marketing landscape. Like all new technologies and trends in our industry, those slow to embrace AI as a conduit for new business and brand exposure will miss the proverbial boat.

GEO is picking up steam as a sub-practice of SEO. You’ll miss a significant opportunity if you’re not targeting some of your marketing efforts to be in this marketplace. Competitors may pick up after you let it slip through the cracks.

We know it’s easy for brands to fall behind in today’s fractioned and ever-growing marketing landscape. If your competitors spend years working on GEO, maximizing LLM visibility and developing skills and expertise in this area, that’s years ahead of you they’ll be.

Now, GEO reporting capabilities haven’t caught up to the value yet, which means it will be tough to measure an ROI, but that doesn’t mean it’s something to ignore and fall behind on.

Brands and marketers must start embracing LLMs like ChatGPT as an emerging acquisition channel that shouldn’t be ignored.

3. OpenAI’s pledge to minimize harm

Source: https://openai.com/safety-standards

A healthy distrust of AI technologies is important to its legal and ethical growth. But we also need to be open-minded and realize we can’t be effective as marketers if we resist and choose not to grow and innovate in the direction of things.

OpenAI clearly states “minimize harm” as one of the guiding principles of their platform. They also have policies to respect copyright and intellectual property and have stated that GPTBot filters out sources violating their policies.

By allowing GPTBot to crawl your site’s content, you’re contributing to the clean and accurate training data OpenAI uses to enhance and improve its information accuracy.

As AI technology marches on, it can be easy to get caught up in skepticism, fear, and noise. Those struggling to embrace and maximize it will get left behind.