The black-box problem
AI providers don’t publish their citation logic in detail. Anyone claiming to know exactly how Perplexity, AI Overviews, or ChatGPT chooses sources is either inside one of those companies or guessing. This post does the second thing carefully.
The reason for the care: AI citation behavior isn’t a single deterministic algorithm. It’s the composition of a retrieval step, a re-ranking step, a synthesis step, and a citation-attribution step — each running on different infrastructure, sometimes calling different sub-models, often gated by user context that’s invisible to the publisher. What’s publicly visible is the output. What’s claimed publicly about the mechanism runs out fast.
So the structure of this post is deliberate:
- What’s actually documented (sourced).
- What’s observable from running queries and watching what happens (labeled as observation, not proof).
- The model that ties both together into something an operator can act on.
Where claims are speculative, I’ll say so. Where they’re sourced, I’ll point at the source. Where they’re observational, the language will be hedged on purpose.
What is publicly known.
Each major AI provider has some public documentation of how their citation behavior works. The detail varies. Here’s a survey, with caveats:
Perplexity publishes guidance about how their search and citation behavior works in their help center, including notes on which sources they tend to favor (sources with strong topical authority, working schema, accessible content). They’ve also published some commentary about preferring sources that pass standard web-search quality signals.
Google AI Overviews documentation lives across Google Search Central. The relevant guidance covers crawler access (Googlebot and Google-Extended for AI training/answers), structured data requirements for surfaces like FAQ and HowTo, and general E-E-A-T signals that apply to all Google surfaces. Google has been explicit that AI Overviews use Search’s underlying ranking systems as a starting point, then apply additional synthesis logic.
OpenAI documents browsing behavior for ChatGPT in their public help articles, including which user-agents they use, how they handle robots.txt (they do respect it, with separate directives for GPTBot, ChatGPT-User, and OAI-SearchBot), and how citation attribution works in browsing-enabled sessions.
Anthropic documents Claude’s web access behavior, including the ClaudeBot user-agent, robots.txt respect, and citation behavior when Claude returns sourced answers.
What every provider’s documentation has in common: they describe what their crawlers are, what they respect (robots.txt, rate limits, sitemap signals), and what content patterns they favor in general terms. None of them publishes the specific re-ranking weights, the synthesis prompt, or the exact criteria for inclusion in a final answer. Those remain undocumented and likely change frequently.
The honest read: there’s enough publicly documented to know what AI systems expect from sites that want to be cited, and enough left unsaid that operating on observations is part of the work.
What seems to correlate, based on observable behavior.
The next paragraphs are the ones to read most carefully. Every claim here is observational. AI systems appear to favor certain patterns, based on aggregate behavior across queries. None of this is a published rule. None of it is a deterministic guarantee.
Sources with strong topical authority appear to be preferred. When a query lands in a topic with established expert sources, the cited answers tend to come from those sources rather than from new entrants. This roughly mirrors classical search ranking — backlinks, citations, and topical depth all contribute. The crossover is not surprising; AI systems retrieve from corpora that include classical search signals.
Sources with clear authorship and entity definition appear to be preferred. When a citation attribution is generated, the AI system needs to identify who produced the content. Pages with explicit Person and Organization JSON-LD, with verifiable sameAs links to public profiles, appear to be cited more reliably than pages where authorship is implicit or absent. This is one of the patterns we observe most consistently in field testing.
Sources with extractable, well-structured answers appear to be preferred for question-style queries. When a user asks “what is X?”, AI systems tend to cite pages with definition-style content that’s lift-safe in isolation. Pages that bury the definition in a long narrative, or whose key answer requires reading three paragraphs of context, get cited less often even when their topical authority is high.
Sources that match the query’s apparent intent shape appear to be preferred. A “best-of” query draws from comparison-style content. A “how-to” query draws from step-list content. A “what-is” query draws from definition content. AI systems appear to retrieve and cite based on shape-matching, not just topic-matching.
Sources that pass technical access checks reliably appear to be preferred. This one is mostly mechanical: if the AI system’s crawler can’t access the page, can’t render its content, or can’t parse its structure, the page can’t be cited even when its authority and content are excellent. Working schema, parseable HTML, allowlisted crawlers — these aren’t citation factors per se, but they’re necessary conditions.
The honest summary of this section: the correlations are observable, the mechanism is partly opaque, and the patterns are stable enough across queries that they’re useful for operators even without a published rulebook.
The Citation Readiness Chain: Access → Understanding → Extractability.
This is the model that ties the observations together into something actionable.
For a site to be cited at all, it has to clear three sequential layers. A failure at any of them caps everything downstream:
Fail at Access: no citation is possible. The AI system’s crawler can’t reach the content. Robots.txt blocks the relevant user-agent, the page returns a non-200 status, the content is locked behind authentication, or JavaScript rendering hides the body from the crawler. Doesn’t matter how good the content is. If the crawler can’t reach it, the answer can’t reference it.
Fail at Understanding: citation is unreliable. The crawler reaches the page, but can’t parse the entity, structure, or intent reliably. The author is “John Doe” with no schema. The Organization is referenced in the footer but has no JSON-LD. The headings don’t structurally mirror the questions the page answers. The AI system might cite the page, but the citation will be brittle — likely to drop on the next retrieval pass when a better-understood source becomes available.
Fail at Extractability: citation may happen but the lift is unclean. The page can be reached and parsed, but the AI system can’t lift a clean, attributable answer from it. The relevant sentence requires four sentences of prior context to make sense. The definition is split across two paragraphs. The comparison table doesn’t have explicit row labels. Citation may still occur, but the cited snippet will be partial or wrong-context, and the AI system is more likely to switch to a cleaner source on the next query.
Citation readiness is a chain. The weakest layer sets the ceiling on the next two. A site that wins at content but fails at access cannot be cited. A site that wins at access and content but fails at extractability gets cited unreliably.
Operators acting on this should solve in order. Access first, because nothing downstream matters without it. Understanding second, because parseable structure conditions everything that follows. Extractability third, because that’s where craft compounds.
What you can change vs. what you can’t.
The line between operator-controllable and operator-uncontrollable is the part of this conversation that gets confused most often.
You cannot make an AI system cite you. You can’t directly change the retrieval logic. You can’t pay for placement. You can’t game the synthesis prompt. You can’t guarantee any specific citation, in any specific query, in any specific AI system, at any specific time.
You can change what your page makes available to those systems. That distinction is the whole game.
Concretely, the levers an operator controls:
- Schema and structured data. Organization, Person, Article, FAQ, HowTo, Product, Review — explicit JSON-LD that an AI system can parse without ambiguity.
- Entity disambiguation. sameAs links to verifiable public profiles, @id-based references, consistent naming across pages.
- Authorship signals. Bylined content with Person schema, biographical depth, verifiable affiliations.
- Crawler access. Per-bot robots.txt directives, render-readiness for AI crawlers that don’t execute JavaScript the same way classical search does, working sitemaps.
- Extractable answer formatting. Definition blocks, FAQ structure, comparison tables, lift-safe sentence patterns.
- Cross-system monitoring. Tracking where the site appears in AI Overviews, Perplexity, ChatGPT, and Claude over time, so the trajectory is visible even when individual queries are noisy.
None of these guarantees citation. Each of them improves the probability that a given AI system, on a given query, in a given user context, can choose your page when it’s looking for a source.
That’s the honest level of control to design for.
The honest summary.
We don’t know exactly how AI systems choose what to cite. We know what they expect in order to be able to cite. We know that retrieval, ranking, synthesis, and attribution each run on different infrastructure with different criteria. We know that public documentation describes the floor, not the ceiling. And we know that the patterns are stable enough across queries that operating on them is the right work.
For the operator-side origin of why I think this matters, the founder note on AIVZ covers the broader thesis. For SEO professionals working out what changes vs. what stays the same, the AEO-for-SEO-pros bridge is the right next read.
If you want to see where your site actually sits on the Citation Readiness Chain — Access, Understanding, Extractability — the fastest path is to run the scan and read the result.