Why Technical SEO Needs an AI Layer

Over the past few months I've been busy rebuilding a significant part of the SiteVitals SEO checker. What started as a fairly standard technical SEO audit - title tags, canonical checks, robots directives - has grown into something considerably more interesting: a tool that can tell you not just whether Google can read your page, but whether AI tools can understand it well enough to recommend you.

This post covers what we built, the technical decisions behind it, and a few things we got wrong the first time.

Why Technical SEO Needs an AI Layer

The existing SEO checker was solid at catching the classic failure modes. Noindex tags in production, missing canonical tags, broken links, malformed Open Graph data. The kind of thing that costs you rankings while the site looks fine to a real user.

But over the last year it has become increasingly obvious that there is a second audience for your pages that conventional SEO tooling does not really account for: AI crawlers. GPTBot, ClaudeBot, PerplexityBot, Google's own AI crawler - these are all making decisions about your content that have nothing to do with traditional search ranking signals. They are reading your structured data to understand what kind of entity you are. They are checking your robots.txt to see if they are even allowed in. And increasingly, they are looking for llms.txt - a relatively new convention that gives AI tools a structured summary of what your site is and what is worth reading.

This matters more than most site owners currently realise. If your structured data is broken, an AI tool cannot reliably identify what your business does or when to recommend it. If your robots.txt is blocking AI crawlers - often unintentionally, through an overly broad rule added by a plugin - your content will not appear in AI-assisted search results regardless of how good it is. These are not edge cases. We see them on well-maintained, professionally built sites every week.

None of that was in the original checker. So we added it.

The Schema Validator: Getting It Right Took Three Attempts

The most technically involved piece was the schema validator. Schema.org markup has always been part of the SEO check - we flagged whether JSON-LD was present - but presence alone tells you almost nothing useful. A page can have structured data that is technically there but structurally broken, and Google's Rich Results Test will tell you exactly what is wrong while our checker was saying "pass".

The first version tried to be clever about nested schemas. JSON-LD can be structured as a @graph containing multiple nodes, as a root array, or as a single root object - and nested typed objects can appear as property values within any of those. The problem was that validator.schema.org and our checker were reporting completely different block counts for the same page.

The root cause was that we were trying to detect nested types after decoding and filter them out. In theory correct. In practice, it was producing false positives - surfacing schemas like BreadcrumbList and Organization as standalone blocks when they were actually just property values within a WebPage.

The fix was simpler than the original approach: stop trying to detect nesting after the fact, and instead never recurse into property values at all. A root object is one block. A root array is N blocks. A @graph is N blocks. A BreadcrumbList that lives inside WebPage.breadcrumb is never touched. That matches what validator.schema.org actually validates, and the block counts now agree.

Three Levels of Validation Per Block

Once extraction was reliable, the actual validation runs in three passes per block.

Level 1 - Structural. Does the block have a known @type? Are all required properties present? Required failures produce a fail status. Missing recommended properties produce a warning. For types like FAQPage and BreadcrumbList, there are additional structural checks that go deeper than just property presence. The FAQPage validator walks the mainEntity array and checks that each entry is a Question type with a name and an acceptedAnswer with text. The BreadcrumbList validator checks that each item has a numeric position and flags single-item breadcrumbs, since Google recommends at least two crumbs to display them in search results.

Level 2 - Semantic. Are the values in the right format? URL fields are validated, date fields checked, offer prices verified as numeric. A common one we catch is the author property being a plain string rather than a nested Person or Organization object - technically present, but semantically weaker than it should be for AI parsing.

Level 3 - AI readiness. This is the new layer. For named entity types - Organization, Product, Article, LocalBusiness and so on - we assess whether the block gives an AI agent enough to unambiguously identify the entity. That means checking for name, url, and description together. A block that has just a name and URL is findable but not fully clear in context.

We also check sameAs links against a list of authority domains - Wikipedia, Wikidata, LinkedIn, Crunchbase, Companies House. These are how AI knowledge graphs resolve an entity to something they already know. An Organization block without any authority sameAs links is valid schema but a missed opportunity for AI discoverability. We surface that as a recommendation rather than an error, but it is worth acting on.

We also flag speakable markup and FAQPage type presence, since that is the highest-value type for AI question-answering.

The practical upshot: a page can have schema that passes every traditional validator and still score poorly on AI readiness. Missing a description, no authority sameAs links, no FAQ content - these are not errors that break your rich results, but they are signals that determine whether an AI tool can confidently recommend you over a competitor whose schema is more complete.

The AI Crawler Audit

Separate from schema, we added a dedicated check for AI crawler access via robots.txt. The check covers the current list of known AI agents:

GPTBot (OpenAI)
ClaudeBot and anthropic-ai (Anthropic)
PerplexityBot (Perplexity)
GoogleOther (Google's AI crawler)
Amazonbot
cohere-ai
Meta-ExternalAgent

The check reports which crawlers are blocked, whether there is a Crawl-delay directive affecting them, and whether robots.txt exists at all.

This one catches people out more than you might expect. A site that has blocked GPTBot - often through an overly broad Disallow: / rule, or a security plugin that added the block without making it obvious - will not appear in ChatGPT's browsing results regardless of how good the content is. We have seen this on sites where the owner was actively trying to grow their visibility in AI search, completely unaware that their own robots.txt was blocking the door.

The llms.txt Checker

llms.txt is a relatively new convention that gives AI tools a structured plaintext summary of what a site is and what is worth reading. The format is simple - a markdown file with a title, description, and optional sections linking to key content - but the implementation details matter.

Our checker validates whether the file exists, whether it is parseable, whether the title and description meet minimum length thresholds, whether linked resources are reachable, and whether llms-full.txt is also present.

One edge case worth noting: some sites return a 200 OK for llms.txt requests even when the file does not exist, because their CMS catches the 404 and serves a custom error page with a 200 status code. We handle this by checking the response body for expected markdown structure rather than trusting the status code alone.

Think of llms.txt as the new robots.txt - but instead of telling crawlers where not to go, it tells AI tools how to understand what you do. Sites that have invested in good content but have no llms.txt are leaving AI tools to figure out their context from scratch, which means they are more likely to be summarised inaccurately or overlooked entirely.

What This Looks Like in Practice

Running the full check against a real site - 18aproductions.co.uk in this case, since we use it as a test bed - the output identified:

Two schema blocks with a missing address property on Organization and OnlineBusiness types
One duplicate OnlineBusiness block injected by a third-party script
A WebSite block with an empty description field - present but blank, which Yoast sometimes produces by default
No llms.txt found at the domain root
All AI crawlers permitted by robots.txt

The missing address on the Organization block is a genuine gap - schema.org requires it, and without it the block cannot be used to generate a Knowledge Panel. The empty description in the WebSite block is worth fixing but lower priority. The duplicate OnlineBusiness is informational - it was being added intentionally by an AIProfiles integration - but worth knowing about.

None of these issues would have been caught by a conventional SEO audit. The page title is fine. The meta description is within length. The canonical is correct. By traditional measures, this page would have passed. By AI readiness measures, there is meaningful work to do.

All of This Is Now in the Free SiteVitals SEO Scan

The schema validator, AI crawler audit, and llms.txt checker all run as part of the standard SEO scan in SiteVitals, with results surfaced in plain language alongside the traditional technical SEO checks.

If you want to see how your own site scores - whether your structured data is complete enough for AI tools to understand you, whether the right crawlers can reach your pages, and whether you have an llms.txt in place - you can run a free check with no account required.

Run a free SEO and AI visibility check on your website at SiteVitals →

By Tom Freeman · Co-Founder & Lead Developer

Full-stack developer specialising in high-performance web applications and automated monitoring.