AI Crawlers and Your Contractor Website: The robots.txt Fix
When a homeowner asks ChatGPT “find me an HVAC company in [city]” the answer does not come from your Google Business Profile. ChatGPT sends a live web crawler to fetch real-time content from across the internet, then assembles its answer from those retrieved pages. If that crawler reaches your website and finds a robots.txt file that blocks it, your business does not appear in the answer. A competitor with an open site does.
Research published in 2026 shows approximately 27 percent of small business websites are accidentally blocking major AI search crawlers, often through default CDN configurations, blanket disallow-all rules in robots.txt, or security settings added years ago by a developer who was not thinking about AI search. Most contractors have never looked at their robots.txt file. If you have not, you may be invisible to ChatGPT Search, Perplexity, and Claude web search, regardless of how strong your service pages are.
Two Types of AI Bots: Training Crawlers vs. Search Crawlers
The confusion starts here. Most contractors who have heard anything about AI bots have been told to block GPTBot to protect their content. That advice is not wrong, but it is incomplete, and acting on half of it creates a visibility problem. There are two fundamentally different types of AI bots visiting your website, and they are operated by the same companies but do entirely different things.
Training crawlers collect content from the web to train or update AI language models. GPTBot (OpenAI), ClaudeBot (Anthropic), and CCBot (Common Crawl) fall into this category. Blocking these prevents your content from contributing to model training datasets. Whether you block them is your decision. It has no effect on whether you appear in AI search results.
Search and retrieval crawlers fetch live web content to answer user queries in real time. OAI-SearchBot (ChatGPT Search), Claude-SearchBot (Claude’s web search feature), and PerplexityBot (Perplexity AI) are retrieval crawlers. Blocking these removes you from the AI search results that homeowners see when they ask these tools for contractor recommendations. These are the bots that directly affect your GEO visibility.
You can block training crawlers and allow search crawlers at the same time. They are completely separate bots, identifiable by different user-agent strings, operated by the same companies. A contractor who added a block for GPTBot thinking it protected their content has not necessarily addressed OAI-SearchBot, Claude-SearchBot, or PerplexityBot at all. Their ChatGPT Search visibility may be fine, or it may be broken, and they have no way to know without checking.
Which Bots Matter and What Blocking Each One Does
| Bot Name | Operated By | Type | Impact of Blocking |
|---|---|---|---|
| GPTBot | OpenAI | Training | Removes content from ChatGPT model training, not from search |
| OAI-SearchBot | OpenAI | Search / Retrieval | Removes you from ChatGPT Search results |
| ClaudeBot | Anthropic | Training | Removes content from Claude model training, not from search |
| Claude-SearchBot | Anthropic | Search / Retrieval | Removes you from Claude web search results |
| PerplexityBot | Perplexity AI | Search / Retrieval | Removes you from Perplexity citations |
| Google-Extended | Training | Removes content from Google AI training, not from AI Overviews | |
| Googlebot | Search | Removes you from all Google results. Never block this. |
For home service contractors, the correct approach is: allow all search and retrieval crawlers, make your own decision about training crawlers based on content preferences, and never block Googlebot under any circumstances. If you want to keep AI companies from using your service pages to train their models while still appearing in AI search results, you can have both. The configurations are independent.
How to Check Your Current robots.txt
Go to your domain and add /robots.txt to the end of the URL. For example: yourdomain.com/robots.txt. A few things you might find:
- 404 error or blank page: You have no robots.txt file. Your site is open to all crawlers by default. Create one to make your permissions explicit, but you are likely not accidentally blocking anyone.
- “Disallow: /” under User-agent: *: Everything is blocked. This is a development configuration that should never reach a live production site. It removes you from every search engine and AI crawler at once. Fix this immediately.
- GPTBot blocked, no mention of OAI-SearchBot: You may still be accessible to ChatGPT Search, since a missing entry defaults to the wildcard User-agent: * rule. Adding an explicit allow for OAI-SearchBot removes ambiguity and protects against CDN-level overrides.
- No mention of PerplexityBot: You are likely accessible, but explicit allow rules protect against CDN bot-management categories that can override robots.txt directives.
The robots.txt Configuration for Home Service Contractors
The following configuration allows all search and retrieval crawlers while optionally blocking training crawlers. Replace the sitemap URL with your actual sitemap address before publishing.
# Allow all search engines by default
User-agent: *
Allow: /
# Explicitly confirm AI search crawlers are allowed
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Optional: Block training-only crawlers if you prefer
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
Sitemap: https://yourdomain.com/sitemap.xml
Specific rules for a named user-agent override the wildcard User-agent: * rule. The OAI-SearchBot Allow: / entry overrides any general block you have set for that specific bot. This structure lets you write restrictive general rules while carving out explicit permissions for the crawlers that matter for AI search visibility.
If your site runs behind a CDN like Cloudflare with bot management enabled, check those settings independently. CDN-level bot rules can block bots by category before robots.txt is ever read. Some CDN configurations group AI crawlers with scraper bots and block them at the edge layer. If you update robots.txt and still cannot confirm crawler access using the verification steps below, the block is likely at the CDN layer, not the robots.txt layer.
The llms.txt Question
There is significant discussion about llms.txt in 2026: a file placed at yoursite.com/llms.txt that describes your business in plain text and links to your most important pages. The idea is that AI systems use it to understand your site faster than crawling every page individually.
The honest reality for local contractors: no major AI search platform has confirmed using llms.txt as a citation or ranking signal as of mid-2026. The file has genuine potential for agentic AI applications where an AI agent actively researches businesses to complete tasks on behalf of a user. For real-time search citations in ChatGPT, Perplexity, or Google AI Overviews, it has not produced measurable improvements for local businesses.
If you have 20 minutes after confirming your robots.txt is correct, creating a basic llms.txt costs nothing. Do not skip the robots.txt fix in favor of llms.txt. The crawlers need access to your site before they can read any file on it.
How to Confirm the Fix Worked
Google Search Console robots.txt report. In Search Console, go to Settings and open the robots.txt section. You can test specific user-agent strings against your current rules and see whether each bot returns Allowed or Blocked. Test OAI-SearchBot, Claude-SearchBot, and PerplexityBot directly after your update.
Perplexity search verification. Search Perplexity for your business name, then search for a key service in your city: “HVAC repair [city]” or “plumber near [neighborhood].” If your website appears as a cited source in the results, PerplexityBot can reach and read your pages. Perplexity is the fastest confirmation method because it surfaces its sources directly. A citation from Perplexity is live proof that the crawler has access to your content.
Three Actions for This Week
- Check your robots.txt file today. Go to yourdomain.com/robots.txt and read what is there. Look for any Disallow rules that could restrict AI search crawlers, either by user-agent name or through a broad Disallow: / block. If you see Disallow: / under User-agent: *, fix that immediately. If you have no robots.txt file, create one using the configuration above. The check takes five minutes and the fix takes ten. The impact on AI search visibility is immediate once crawlers re-index your pages.
- Add explicit Allow rules for AI search crawlers. Even if your current file appears to allow everything, adding explicit Allow: / rules for OAI-SearchBot, Claude-SearchBot, and PerplexityBot eliminates ambiguity and protects against CDN-level bot management overrides. If you use Cloudflare or another CDN with bot protection enabled, check those settings separately and confirm AI search crawlers are not grouped with malicious bots in a blocked category.
- Run the Perplexity verification search. After updating your robots.txt, wait 24 to 48 hours and search Perplexity for your business name and for a key service in your city. If your site appears as a cited source, the fix worked. If it does not appear within one week, check your CDN settings for edge-level blocks that robots.txt cannot override. Note the date of your search so you have a baseline for tracking AI search visibility going forward.
AI search crawlers follow the same access rules as traditional search crawlers: they go where they are allowed and stop where they are not. A contractor who has never reviewed their robots.txt for AI access has no way of knowing whether ChatGPT Search and Perplexity can reach their service pages. The check takes five minutes. The fix takes ten. The result is visibility across every AI platform that uses retrieval crawlers to answer homeowner queries in real time, whether or not those homeowners ever open a Google search tab.