Part of the SEO audit
Check your crawl directives before they block the wrong pages
Robots.txt and noindex directives tell crawlers what to do. SiteCurl checks the file, validates basic rules, and surfaces indexing blockers.
No signup required. Results in under 60 seconds.
What this check does
SiteCurl fetches /robots.txt and checks whether the file exists and appears valid. It also looks for robots directives on the page itself, including noindex values that prevent pages from appearing in search.
The check validates the syntax of the robots.txt file, looking for malformed directives that crawlers may ignore or misinterpret. It also checks whether the file accidentally blocks CSS, JavaScript, or image resources that search engines need in order to render the page correctly.
On each scanned page, SiteCurl inspects both the HTML <meta name='robots'> tag and the HTTP X-Robots-Tag header for noindex directives. Pages marked noindex are flagged so you can verify the directive is intentional.
How this shows up in the real world
Robots.txt controls crawling, not indexing. A Disallow rule tells crawlers not to request a URL, but it does not remove the URL from search results if Google already knows about it from external links or previous crawls. To prevent indexing, you need a noindex directive. Confusing these two mechanisms is one of the most common technical SEO mistakes.
The robots.txt file is read before the crawler requests any page. If the file is missing, most crawlers assume everything is allowed. If the file contains a syntax error, the behavior depends on the crawler. Googlebot tends to be forgiving with minor errors, but other crawlers (Bing, AI crawlers, SEO tools) may interpret the file differently.
Large sites often accumulate robots.txt rules over years of development. A rule added during a migration may still be active long after the migration is complete. Reviewing the file periodically prevents stale rules from blocking pages that should now be crawlable.
The interaction between robots.txt and noindex deserves special attention. If robots.txt blocks a page that also has a noindex tag, the crawler cannot see the noindex tag because it never requests the page. The page may remain in the index indefinitely, showing a snippet generated from anchor text instead of page content.
Why it matters
Crawl-control mistakes are costly because they can affect an entire site. A single bad rule can hide key pages from search engines or block assets that search engines need in order to render the page properly.
A misplaced Disallow rule can take down entire sections of a site in search results. Unlike a content change that affects one page at a time, robots.txt operates at the directory level. Blocking /services/ removes every page under that path from the crawl, which can mean dozens or hundreds of pages disappear from search within days.
Noindex directives on individual pages are equally powerful. A single meta tag or HTTP header can remove a page from the index permanently. When these directives are added accidentally (by a CMS plugin, a deployment script, or a CDN setup), the damage is silent until someone notices the traffic drop.
Who this impacts most
Sites that recently launched or migrated are most at risk. Development teams often add blanket Disallow rules during staging to prevent search engines from indexing test content. If those rules are not removed before launch, the production site launches with crawling blocked. This is one of the most common post-launch SEO failures.
Sites behind CDNs or reverse proxies should pay special attention to X-Robots-Tag headers. These headers can be added at the infrastructure level without any change to the application code. A misconfigured Cloudflare rule or Nginx directive can add noindex headers to every response without the development team knowing.
WordPress and CMS-based sites sometimes have a 'Discourage search engines' setting that adds noindex directives site-wide. If this setting is accidentally enabled in production, every page on the site is told not to index. SiteCurl catches this by checking for noindex on every scanned page.
How to fix it
Step 1: Verify the file exists at the root domain. Visit https://yourdomain.com/robots.txt directly. It should return a 200 status with plain text content. If it returns a 404 or an HTML page, the file is not configured correctly.
Step 2: Review each Disallow rule. Read every rule in the file and confirm it only blocks areas you intend to hide: admin panels, internal search results, staging paths, and user account pages. Remove any rules that block public content, CSS files, or JavaScript files.
Step 3: Check for noindex directives on important pages. Open the page source and search for noindex. Also check the HTTP response headers using your browser's developer tools or a curl -I command. Remove any noindex directive from pages you want indexed.
Step 4: Separate staging and production setups. Use environment variables or deployment scripts to ensure the staging robots.txt (which may block all crawlers) is never deployed to production. Add a deployment check that verifies the production robots.txt does not contain Disallow: / for all user-agents.
Common mistakes when fixing this
Copying staging robots.txt into production. This is one of the fastest ways to disappear from search. A blanket Disallow: / rule tells every crawler to stay away from the entire site.
Blocking a folder that contains important pages. Broad rules like Disallow: /blog/ are easy to add and easy to forget. If the blog contains pages that drive organic traffic, the rule silently removes them from the crawl.
Forgetting header-level X-Robots-Tag directives. CDNs, proxies, and server setups can add noindex headers without any visible change in the page source. These headers override page-level settings and are invisible unless you inspect the HTTP response.
Blocking CSS and JavaScript resources. Search engines need these files to render the page. If robots.txt blocks /assets/ or /static/, the crawler sees a blank or broken page and cannot evaluate the content properly.
How to verify the fix
Run another SiteCurl scan and confirm the robots-related warnings are cleared. Then inspect the live /robots.txt file and the page headers to make sure the directives match your intent.
Use Google Search Console's URL Inspection tool to check whether Google can access your key pages. The tool shows whether the page was blocked by robots.txt and whether a noindex directive was detected. For a broader check, review the Coverage report for pages excluded due to robots.txt or noindex rules.
Example findings from a scan
robots.txt file not found
robots.txt contains an invalid directive
Page has noindex directive in HTML or response headers
Related checks
Frequently asked questions
Does robots.txt remove a page from search results?
Not by itself. Robots.txt controls crawling, while noindex controls indexability. In practice, both need to be reviewed together.
Should I block internal search or account pages?
Often yes, but only if those pages do not need to rank. Crawl control should follow business intent, not habit.
Can X-Robots-Tag override page HTML?
Yes. Header-level directives can block indexing even when the page source looks clean.
Check your robots directives now