Part of the SEO audit

Check your crawl directives before they block the wrong pages

Robots.txt and noindex directives tell crawlers what to do. SiteCurl checks the file, validates basic rules, and surfaces indexing blockers.

Start 7-Day Studio Trial

No signup required. Results in under 60 seconds.

423,000+ checks run and counting

What this check does

SiteCurl fetches /robots.txt and checks if the file exists and looks valid. It also looks for robots rules on each page, such as noindex values that keep pages out of search.

The check reads the syntax of the robots.txt file, looking for bad rules that crawlers may ignore or misread. It also checks if the file blocks CSS, JS, or image files that Google needs to render the page right.

On each scanned page, SiteCurl reads both the HTML <meta name='robots'> tag and the HTTP X-Robots-Tag header for noindex rules. Pages marked noindex are flagged so you can confirm the rule is on purpose.

How this shows up in the real world

Robots.txt controls crawling, not indexing. A Disallow rule tells crawlers not to fetch a URL, but it does not remove the URL from search results if Google already knows about it from outside links or past crawls. To stop indexing, you need a noindex rule. Mixing up these two tools is one of the most common technical SEO mistakes.

The robots.txt file is read before the crawler asks for any page. If the file is missing, most crawlers assume all is open. If the file has a syntax error, the result depends on the crawler. Google tends to forgive small errors, but Bing, AI crawlers, and SEO tools may read the file in new ways.

Large sites often pile up robots.txt rules over years of dev work. A rule added during a move may still be live long after the move is done. Reading the file now and then stops stale rules from blocking pages that should now be open.

The link between robots.txt and noindex needs extra care. If robots.txt blocks a page that also has a noindex tag, the crawler cannot see the noindex tag since it never loads the page. The page may stay in the index for good, showing a snippet made from anchor text instead of page content.

Why it matters

Crawl-control mistakes are costly. They can hit a whole site at once. A single bad rule can hide key pages from Google or block files that Google needs to render the page right.

A wrong Disallow rule can drop full sections of a site from search results. Unlike a content change that hits one page at a time, robots.txt works at the folder level. Blocking /services/ removes each page under that path from the crawl. That can mean dozens or hundreds of pages vanish from search within days.

Noindex rules on single pages are just as strong. One meta tag or HTTP header can pull a page from the index for good. When these rules are added by mistake (by a CMS plugin, a deploy script, or a CDN setup), the damage is silent until someone spots the traffic drop.

Who this impacts most

Sites that just launched or moved are most at risk. Dev teams often add broad Disallow rules during staging to keep Google from indexing test content. If those rules are not removed before launch, the live site goes out with crawling blocked. This is one of the most common post-launch SEO failures.

Sites behind CDNs or reverse proxies should watch for X-Robots-Tag headers. These headers can be added at the server level with no change to the app code. A bad Cloudflare rule or Nginx line can add noindex headers to each response with no one on the dev team knowing.

WordPress and CMS sites may have a 'Discourage search engines' switch that adds noindex rules site-wide. If this switch is turned on in production by mistake, each page on the site is told not to index. SiteCurl catches this by checking for noindex on each scanned page.

How to fix it

Step 1: Check that the file exists at the root domain. Visit https://yourdomain.com/robots.txt in your browser. It should return a 200 status with plain text. If it returns a 404 or an HTML page, the file is not set up right.

Step 2: Read each Disallow rule. Go through each rule and confirm it only blocks areas you want hidden: admin panels, site search results, staging paths, and user account pages. Remove any rules that block public content, CSS files, or JS files.

Step 3: Check for noindex rules on key pages. Open the page source and search for noindex. Also check the HTTP response headers using dev tools or a curl -I command. Remove any noindex rule from pages you want indexed.

Step 4: Keep staging and production apart. Use env vars or deploy scripts to make sure the staging robots.txt (which may block all crawlers) is never shipped to production. Add a deploy check that confirms the production robots.txt does not contain Disallow: / for all user-agents.

Common mistakes when fixing this

Copying staging robots.txt into production. This is one of the fastest ways to vanish from search. A blanket Disallow: / rule tells each crawler to stay away from the full site.

Blocking a folder that holds key pages. Broad rules like Disallow: /blog/ are easy to add and easy to forget. If the blog has pages that drive organic traffic, the rule silently pulls them from the crawl.

Missing header-level X-Robots-Tag rules. CDNs, proxies, and server setups can add noindex headers with no visible change in the page source. These headers override page-level rules and are hidden unless you check the HTTP response.

Blocking CSS and JS files. Google needs these files to render the page. If robots.txt blocks /assets/ or /static/, the crawler sees a blank or broken page and cannot judge the content right.

How to verify the fix

Run a new SiteCurl scan and confirm the robots warnings are cleared. Then check the live /robots.txt file and the page headers to make sure the rules match your intent.

Use Search Console's URL tool to check if Google can reach your key pages. It shows if the page was blocked by robots.txt and if a noindex rule was found. For a broader look, review the Coverage report for pages kept out by robots.txt or noindex rules.

Example findings from a scan

robots.txt file not found

robots.txt contains an invalid directive

Page has noindex directive in HTML or response headers

Frequently asked questions

Does robots.txt remove a page from search results?

Not by itself. Robots.txt controls crawling, while noindex controls indexability. In practice, both need to be reviewed together.

Should I block internal search or account pages?

Often yes, but only if those pages do not need to rank. Crawl control should follow business intent, not habit.

Can X-Robots-Tag override page HTML?

Yes. Header-level directives can block indexing even when the page source looks clean.

Check your robots directives now