Does heading structure affect AI citation rates?

Yes, measurably. Research correlates question-phrased H2 and H3 headings with higher citation rates in AI-generated responses. The reason is practical: AI retrieval systems match user queries to page content, and a heading phrased as the question a user typed is a near-exact signal. Standard SEO heading rules still apply: one H1 per page, major sections under H2, subtopics under H3.

What does llms.txt actually do, and should I add one to my site?

llms.txt is a Markdown file at your domain root that describes your site's content for AI systems. As of June 2026, Google has confirmed it has no effect on Search rankings or AI Overviews, and no major AI platform has confirmed it affects citation behavior in search-style responses. Its main documented use is in developer tooling: IDEs like Cursor and AI coding assistants like Claude Code fetch it when navigating documentation. For business sites, it is low-effort and harmless to implement, but it is not an AEO priority.

Which AI crawlers should I allow in robots.txt?

The key distinction is between training crawlers and retrieval crawlers. Training crawlers like GPTBot and CCBot collect content for model training. Retrieval crawlers like OAI-SearchBot, PerplexityBot, and ClaudeBot index content to serve in AI-generated answers. Blocking retrieval crawlers removes your content from those platforms' responses. A targeted approach blocks training-only crawlers while allowing retrieval crawlers, rather than applying a blanket Disallow to all AI user agents.

Does schema markup help with Google AI Overviews?

Not directly. Google states that structured data is not required for generative AI features and there is no special schema.org markup that improves AI visibility. Schema still matters for traditional search rich results and for the broader AI ecosystem outside Google. Critically, LLMs read JSON-LD as plain tokenized text rather than as a structured graph, which means the entity names and facts in your schema get processed, but the structured relationships are not interpreted as intended. Content quality remains the dominant citation factor.

How to Structure a Page for Both SEO and AEO in 2026

Q: Is FAQPage schema worth adding if Google no longer shows FAQ rich results?

Yes. Google deprecated FAQ rich results for most sites in May 2026, reserving the feature for government and health-focused authoritative sources. However, Bing, ChatGPT, Perplexity, Claude, and Gemini all process FAQPage schema. Question-and-answer pairs in structured data map directly to the query formats these platforms handle, making FAQPage schema more valuable for the broader AI ecosystem than it was when Google rich results were the primary benefit.

Google’s core ranking signals have not changed as dramatically as the headlines suggest. What has changed is that your page now needs to satisfy two parallel reading systems: the traditional crawler that indexes your content for ranked results, and the retrieval systems that pull passages into AI-generated answers across Google AI Overviews, ChatGPT, Perplexity, Claude, and Gemini.

The best practices overlap substantially. A page structured well for human readers is already close to what AI retrieval systems need. The gaps are specific and addressable. What follows covers each element with code examples you can use directly.

What Heading Structure Actually Does for Search and AI

Use one H1 per page. Google does not penalize pages with multiple H1 tags, but a single H1 gives both crawlers and AI systems an unambiguous topic signal. Keep it between 30 and 65 characters and front-load the primary entity or keyword. If a reader could identify the page topic from the H1 alone, it is doing its job.

For H2s, use question phrasing where the section is answering a user query. Research from Search Engine Land found that question-structured H2 and H3 headings correlate with higher citation rates in AI responses. The mechanism is practical, not algorithmic: retrieval systems map user queries to page content, and a heading phrased as the question a user typed is a near-exact match signal. This also benefits human readers who scan rather than read linearly.

H3s handle subtopics within each H2 section. They can be questions or statements depending on which makes the content easier to navigate.

<!-- Page heading hierarchy example -->

<h1>How to Structure a Page for Both SEO and AEO in 2026</h1>

<h2>What heading structure does Google want in 2026?</h2>
  <h3>How many H1 tags should a page have?</h3>
  <h3>Should H2 headings be phrased as questions?</h3>

<h2>Which schema types still matter after Google's May 2026 updates?</h2>
  <h3>Is FAQPage schema worth adding if Google no longer shows FAQ rich results?</h3>

<h2>What does llms.txt do and should I add one?</h2>

Google uses heading hierarchy to understand the topical relationships across a page. AI systems use it as a navigation map when extracting relevant passages for a response. Both reward a clear, logical hierarchy over a flat heading structure where every section sits at H2 regardless of depth.

The Schema Types That Still Matter

JSON-LD is the only schema format worth implementing in 2026. Microdata is outdated. RDFa is technically valid but adds markup complexity without practical benefit. JSON-LD sits in a <script type="application/ld+json"> block in your <head>, completely separate from your visible content, which makes it easy to add and maintain without touching your HTML structure.

Article schema

For blog posts, guides, and editorial pages, Article schema is the baseline. It establishes authorship, publication date, and topic for Knowledge Graph entity recognition and E-E-A-T evaluation. The dateModified field matters for freshness-sensitive queries.

{
  "@context": "https://schema.org",
  "@type": "Article",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://example.com/blog/your-post"
  },
  "headline": "Your Article Title",
  "datePublished": "2026-07-01",
  "dateModified": "2026-07-01",
  "author": {
    "@type": "Organization",
    "name": "Your Organization",
    "url": "https://example.com"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Your Organization",
    "url": "https://example.com",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/images/logo.png"
    }
  }
}

BreadcrumbList schema

BreadcrumbList tells Google exactly where this page sits in your site hierarchy, which produces breadcrumb display in search results and helps AI systems understand the relationship between your content and the rest of your site.

{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [
    {
      "@type": "ListItem",
      "position": 1,
      "name": "Home",
      "item": "https://example.com/"
    },
    {
      "@type": "ListItem",
      "position": 2,
      "name": "Blog",
      "item": "https://example.com/blog"
    },
    {
      "@type": "ListItem",
      "position": 3,
      "name": "Your Article Title",
      "item": "https://example.com/blog/your-post"
    }
  ]
}

Organization schema on your homepage

A complete Organization schema on your homepage is the single most important entity signal you can send. Include your address, phone, email, and sameAs links pointing to your verified social and directory profiles. This is where AI platforms form their baseline understanding of who your business is and what it does. A correctly nested AggregateRating, if you have reviews, adds credibility signals on top of the identity data.

FAQPage Schema: What Google Deprecated and What Still Works

Google deprecated FAQ rich results for most sites in May 2026. They now display only for government and health-focused authoritative sources. If you run a business blog or service site, you will not see the expandable FAQ accordion in Google search results regardless of how well your FAQPage schema is implemented.

FAQPage schema still has real value, just aimed at a different audience than before.

Bing, ChatGPT, Perplexity, Claude, and Gemini all process schema.org markup as part of their content understanding pipelines. Question-and-answer pairs map directly to the query formats these platforms handle every day. An AI system generating a response to a user question can extract an answer from FAQPage markup more cleanly than from dense prose, because the question and its answer are already paired in a machine-readable format.

Google deprecated FAQ rich results for most sites. The rest of the AI ecosystem did not follow. FAQPage schema in 2026 is primarily an investment in non-Google AI visibility, not Google rich result eligibility.

The practical recommendation: include FAQPage schema on any page with genuine FAQ content. The primary audience for that schema is now Bing, ChatGPT, Perplexity, and Claude rather than Google’s search results page. If you have 5 to 12 real questions your customers ask, mark them up.

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Your question here?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Your concise, direct answer here. Write it as a complete sentence that works out of context, because AI systems may extract it independently of the surrounding page."
      }
    },
    {
      "@type": "Question",
      "name": "A second question?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "The second answer. Each answer should stand alone without requiring the reader to have seen the question in context."
      }
    }
  ]
}

How Headings and FAQs Work Together for AEO

The most effective structure for answer engine optimization pairs question H2 headings with a FAQ section that mirrors those questions. The H2 establishes the topic at the section level with detail in the body text below it; the FAQ provides a direct, concise answer in structured form at the page level. AI retrieval systems benefit from both: the full-context explanation in the body, and the extractable Q&A pair in the FAQ.

Place your core answer in the opening paragraph. Research shows that 55% of AI Overview citations pull from the first 30% of page content. The heading signals the question; the first sentence below it should answer it directly before expanding into detail. This mirrors how featured snippet optimization has always worked, and the principle transfers to AI retrieval because both systems reward content that answers the query before elaborating.

What LLMs Actually Read on Your Page

Schema markup is often discussed as if it provides a structured graph that AI systems navigate, where "@type": "Article" and "author" properties communicate semantic relationships in a machine-readable hierarchy. Language models actually process your page very differently.

LLMs tokenize your entire page as plain text, including JSON-LD blocks. The model does not parse the graph relationships in your schema. It reads "author": { "@type": "Organization", "name": "WebSight Design" } as a sequence of tokens, the same way it reads your body copy. The structured relationships are not interpreted as graph edges.

What this means in practice:

The entity names and facts in your schema do get processed. An Organization schema with your business name, address, and sameAs links adds those facts as tokens in the content the model reads. That is part of why schema has marginal positive value for AI citation behavior even though the semantic structure is invisible to the model. You are not sending a graph. You are embedding additional factual text.

Content quality is the dominant citation signal across every AI platform. A specific, well-sourced answer to a real user question consistently outperforms a thin paragraph with extensive schema decoration.

This aligns with what our own research into AI citation factors found, and with Google’s official AI search guidance. The structured data provides context. The content earns the citation. To see how AI platforms currently describe your brand, you can run a free audit across ChatGPT, Claude, Gemini, and Perplexity.

llms.txt: What the Evidence Shows as of June 2026

llms.txt is a plain text or Markdown file placed at your domain root that describes your site’s content and links to the pages most useful to AI systems. The format was proposed in 2024 and positioned as a robots.txt equivalent for large language models, giving site owners a way to curate what AI systems see.

Here is the current state of evidence, as of June 2026:

Google confirmed on June 15, 2026 that llms.txt has no effect on Search rankings or AI Overviews. OpenAI, Anthropic, Meta, and Google have not announced formal support for the spec as a citation signal. Independent monitoring found only 408 llms.txt requests out of over 500 million AI bot visits across a 90-day window. Eight out of nine sites saw no measurable change in traffic after implementing one.

Where it does demonstrably work: developer tooling. LangChain ships an open-source MCP server that fetches and exposes llms.txt files from documentation sites. Cursor, Claude Code, Windsurf, and Aider all consume llms.txt when navigating documentation in real time. For software documentation sites, this is genuine utility. The AI coding assistant pointing at your docs will use it. The AI answering your customers’ questions about your business almost certainly will not.

For business sites, llms.txt is a low-effort implementation with no confirmed SEO upside and no confirmed downside. The parallel to robots.txt is instructive: robots.txt was a de facto convention for years before any formal standardization, and adoption happened because the convention was already in widespread use. llms.txt may follow the same trajectory, or it may not. Adding one costs 30 minutes and does not hurt anything.

# Your Site Name

> One-sentence description of what your site or business does.

## Key Pages

- [Home](https://example.com/): Brief description
- [How It Works](https://example.com/how-it-works): How the product or service works
- [Pricing](https://example.com/pricing): Pricing and plans
- [Blog](https://example.com/blog): Articles and guides

## Optional: Full Content Index

- [Article Title](https://example.com/blog/slug): Brief description of the article

robots.txt and the AI Crawler Landscape

robots.txt has been the de facto standard for web crawler access control since 1994. As Wikipedia’s robots.txt article documents, by June 1994 it had near-universal compliance among major web crawlers, becoming a de facto standard through voluntary adoption before any formal specification existed. The Robots Exclusion Protocol was formally published as RFC 9309 in 2022, three decades after the convention entered common use.

In 2026, the crawler landscape has shifted considerably. AI crawlers now account for approximately 26.7% of verified bot traffic, based on Cloudflare’s May 2026 network data. The practical question is no longer simply “allow or deny” but “training or retrieval.”

The distinction is commercially significant:

Training crawlers collect your content to train foundation models. GPTBot collects content for OpenAI model training. CCBot collects content for Common Crawl datasets used across multiple AI labs. Blocking these prevents your content from appearing in future training data.

Retrieval crawlers index your content to serve in AI-generated search results. OAI-SearchBot indexes content for ChatGPT Search results. PerplexityBot indexes content for Perplexity answers. Blocking retrieval crawlers removes your content from those platforms’ responses to user queries.

If you apply a blanket Disallow to all AI crawlers without distinguishing between the two types, you may protect your content from training use while inadvertently making your brand invisible in AI search results. Most practitioners in 2026 take a selective approach: block training-only crawlers by preference, and allow retrieval crawlers explicitly.

User-agent: *
Disallow:

# Block training-only crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

# Allow AI retrieval/search crawlers explicitly
# (they default to allowed, but documenting intent is useful)
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

Note that RFC 9309 does not define AI-specific directives, and some AI training crawlers have been documented ignoring robots.txt. Compliance among retrieval crawlers is generally higher because those companies have commercial reasons to maintain publisher trust. robots.txt is not an absolute control mechanism; it is a broadly respected convention.

If AI platforms are not seeing your site at all, robots.txt may not be the only or even primary cause. Cloudflare security rules, JavaScript rendering limitations, noindex tags, paywalls, and other access restrictions all affect whether AI crawlers can read your content. For a full breakdown of common reasons AI systems cannot access site content, see Why can’t AI see my site?

Sources

Mark Up FAQs with Structured Data, Google Search Central
Changes to HowTo and FAQ rich results, Google Search Central Blog
robots.txt, Wikipedia
RFC 9309: Robots.txt Is Now an Official IETF Internet Standard, Search Engine World
Google’s Official AI Search Optimization Guide Pushes Back on Common AEO Tactics, WebSight Design
The Schema Myth: What AI Platforms Actually Use to Decide What to Cite, WebSight Design
llms.txt Explained (May 2026): The Honest Guide to the Spec, Adoption, and How to Ship One, Coders Era
Google Says llms.txt Does Nothing for SEO Rankings, Digital Applied
We Analyzed robots.txt Across Cloudflare’s Network, Technology Checker