The Anatomy of an AI-Cited Website: What ChatGPT Actually Pulls

After reverse-engineering hundreds of AI answers over the last two years, I can tell you there is a visible pattern in the pages that get cited. It is not random, and it is not just about domain authority. I am Jake McCluskey, and I want to walk you through a teardown of what a cited page actually looks like under the hood. No fluff, no jargon. Just the specific anatomy of a page AI systems choose, side by side with one they skip. If you can match the pattern on your own key pages, you change your odds dramatically.
What does an AI-cited page look like at a glance?
An AI-cited page has five visible traits: a question in the H2, a direct answer in the first paragraph under it, clean schema in the HTML head, a named author with expertise, and specific details (numbers, tools, timeframes) in every section. Miss any two of those and your citation rate drops.
You can spot the pattern just by reading the page the way a model would. Scan the H2s. Are they questions? Look at the first sentence under each H2. Does it answer the question directly? Check the page source. Is there JSON-LD? Is the author a real Person entity? If you say yes to all of those, you have the skeleton of a citation-ready page.
This is not the flashy part of SEO. It is plumbing. But the plumbing is exactly what models evaluate when they pick whose paragraph to quote.
What does the HTML source of a cited page actually contain?
The HTML source of a cited page contains clean, semantic structure with readable content delivered on the initial page load, plus structured data in JSON-LD. The model is not going to wait for heavy JavaScript to render. It reads what comes back from the first request.
When I View Source on a page that consistently gets cited, I see a clear H1 with the topic, H2s written as questions, short paragraphs under each, and lists where they help. Images have real alt text. Links are real anchor tags with descriptive text, not generic 'click here' labels. The JSON-LD sits in the head and covers Organization, Person, Article or BlogPosting, and usually FAQPage.
Contrast that with a page that never gets cited. Same topic, similar word count, but the source is mostly div wrappers and JavaScript placeholders. The H2s are decorative phrases rather than questions. The schema is either missing or wrong. No author. No dates. The model either never sees it, or sees a blob of text with no attribution path.
Here is the honest version of the lesson. If your content is rendered entirely by client-side JavaScript and your JSON-LD depends on that render, you are probably invisible to most AI retrieval systems today. Server-side rendering, or at least a pre-rendered HTML snapshot, is not optional.
What does the schema look like on a page that gets quoted?
A quoted page almost always has four schema blocks working together: Organization, Person (the author), Article or BlogPosting (the content), and FAQPage (for the FAQ block). They reference each other through @id fields so the model can trace identity and expertise in one pass.
Organization schema lives on every page, usually in a shared layout template. It declares your legal name, logo, URL, description, and a sameAs array that links to your LinkedIn, Google Business Profile, Wikidata entry, Crunchbase, and any industry profiles. That sameAs array is what proves you are a real business with external corroboration.
Person schema declares the author, with their own name, jobTitle, url, image, and sameAs links to their LinkedIn, personal profiles, and Wikidata if applicable. The Article schema then uses the Person as its author property, referenced by @id. That chain is how the model credits the human behind the content.
FAQPage schema wraps the FAQ block at the bottom of the page. Each question is a Question entity with an acceptedAnswer that contains the exact text. When a model is searching for a direct answer to a user's question, FAQPage entries are some of the most clearly retrievable content on the web. They get pulled disproportionately often.
What kinds of paragraphs get quoted most often?
Models quote paragraphs that are short, self-contained, and answer a specific question without needing context from earlier in the page. The ideal quoted paragraph is two to four sentences, opens with a direct statement, and contains at least one specific detail (a number, a name, a timeframe).
The opening sentence does most of the work. If a paragraph starts with a direct factual claim, it has already signaled that the rest of the paragraph is the elaboration. Models prefer that shape because it matches the way they generate answers themselves. A paragraph that starts with 'There are many factors to consider when' gets skipped. A paragraph that starts with 'GEO usually takes 30 to 90 days to show results' gets quoted.
Specificity matters. Compare these two sentences. One, 'Businesses should consider AI tools that align with their goals.' Two, 'Most small businesses get their fastest ROI from AI tools that automate lead triage inside HubSpot or Salesforce.' The second one gets quoted. The first one sounds like half the internet.
And a small note that matters a lot. The quoted paragraph almost always sits immediately under a question-style H2. The H2 is the retrieval handle. Without it, the model has no cue that this paragraph is an answer to anything.
What H2 structure works best for AI retrieval?
The H2 structure that works best is a page built from five to eight question-style H2s, each covering a single clear sub-question under the page's main topic. Each H2 gets a two or three sentence direct answer, then expansion. No filler H2s, no section headings that are just category labels.
A pattern I use on nearly every client page is this. H2 one is the definitional question (what is this thing). H2 two is the differentiator question (how is it different from what I already know). H2 three is the how-it-works question. H2 four is the signals or requirements question. H2 five is the timeline or cost question. H2 six is the common mistakes question. H2 seven is the checklist or action question.
That structure covers the retrieval questions a model is most likely to ask in the background when it builds an answer. The model doesn't just ask the user's explicit question. It asks a web of related sub-questions and pulls the best answer for each. A page that covers all of them is far more likely to get quoted multiple times across multiple answers.
Cited page vs uncited page: what is actually different?
The difference between a cited page and an uncited page is almost never content quality. It is structure, schema, and specificity. Two pages can have the same basic expertise, and one gets quoted a hundred times while the other sits dormant.
Let me describe a real comparison I ran last year. Two legal services firms, same city, same practice area, similar traffic. Firm A had a page titled 'Our Estate Planning Services' with paragraphs describing what they do. Firm B had a page titled 'How does estate planning actually work?' with question-style H2s, answer-first paragraphs, specific cost ranges, FAQ schema, and Person schema on the author attorney.
Over three months, Firm B got cited in roughly 40 ChatGPT and Perplexity answers we could track. Firm A got cited zero. Same credibility, same expertise, completely different outcome. The difference was entirely in how the pages were structured and marked up.
That gap is the one worth obsessing over. You don't need to be a bigger firm or a more well-known brand. You need to be the more retrievable answer for the questions your buyers actually ask.
How do AI systems decide which of several similar pages to quote?
When two or three pages cover the same topic at similar quality, AI systems tie-break using entity strength, freshness, and passage quality. The page with the clearer Person and Organization entity, the more recent dateModified, and the tighter answer paragraphs wins the citation.
Entity strength is the part most teams underinvest in. If your page and a competitor's page are roughly equal in content, but their author has a Wikidata entry, a LinkedIn with 10,000 followers, and 20 citations across the web, while your author has nothing, the model picks them. Not because the content is better. Because the attribution is safer.
Freshness works both for and against you. A page with a recent dateModified and recently added content beats an older page on the same topic even if the older page has more links. But freshness without substance backfires. A page that was 'updated' by changing two words and bumping the date can lose trust.
Passage quality is the micro-level tie-breaker. Once the model is down to a shortlist of pages, it scores individual paragraphs on clarity, specificity, and standalone readability. A tight three-sentence answer with a named number beats a five-sentence paragraph that wanders. Cut every paragraph on your key pages down to its most quote-worthy form.
What can a marketing lead ship this quarter?
A marketing lead can ship four things this quarter that will move AI citation rates: question-style H2s with answer-first paragraphs across the top ten pages, Organization and Person schema shipped site-wide, FAQPage schema on any page with a real FAQ block, and a named author entity (even if that is the founder) connected to every piece of content.
Those four items do not require a massive budget or a rebuild. Your developer can ship the schema in a week or two. Your content lead or copywriter can rework the top ten pages in another two weeks. The author entity is usually a two-hour task if you already have a real person to credit.
If you want a structured way to approach this, I run audits that map your current site against the anatomy described above and give you a prioritized fix list. You can see how that fits my broader services, or just request a free audit if you want to see where you stand before committing to anything larger.
The anatomy of an AI-cited page is not a secret. It is a pattern any team can match if they know what to build toward. The businesses that will dominate AI citations in 2027 are not the ones with the biggest brand today. They are the ones who are right now turning their top ten pages into clean, structured, specific answers. If you want me to look at your site against this anatomy and tell you which two or three changes would move the needle first, a short discovery call is the fastest way to get there.