The AI Search Manual

CHAPTER 9

How to Appear in AI Search Results (The GEO Core)

Whether it’s AI Overviews in Google Search, conversational responses in ChatGPT, or synthesized answers in Perplexity, the question content creators and businesses now face is how to show up in all of these places.

Let’s look at some ways you can potentially do that.

Specificity and Extractable Data Points

Generative engines validate, compare, and often cite content in their summaries. In that process, concrete facts, figures, dates, and measurable data points become critical signals. The more specific your content is, the more likely it is to be selected, synthesized, and surfaced.

What to Focus On:

  • Include specific statistics and quantifiable facts: AI prefers clear numbers over vague generalizations, such as “85 percent of users” instead of “most users.” 
  • Use full dates, not just years or phrases: The models use timestamps to assess content freshness and context. Writing “as of April 2024” or “between 2021 and 2023” gives the model a clearer picture than “in recent years.”
  • Present data in extractable formats: Use tables, bullet points, or clearly labeled metrics, like “Google’s AI Overviews appeared in 51.4 percent of U.S. search queries in May 2024.” 
  • Support claims with links to trusted sources: When referencing numbers or studies, cite original data where possible. This improves your perceived authority and gives the model a traceable source to validate.

Measurable data helps AI systems evaluate whether content can be trusted, so they can summarize more confidently, align facts across multiple sources, and identify your content as a reliable contribution to an answer.

Structured Data and Meta Signals

Generative AI models disambiguate topics, identify entities, and determine the usefulness of content without reading every word on the page. They have moved beyond simple keyword matching and rely more heavily on structured signals to interpret and reassemble information. Schema markup, meta descriptions, and other structural hints give them the clarity they need to understand the meaning, relationships, and utility of your content at the page level and within individual elements.

These signals don’t just improve discoverability. They also enhance your inclusion in generative outputs.

If you are going to create a robust, machine-readable knowledge base you have to look beyond Schema.org to provide additional layers of direction.

Here are a few ways to evolve your structured data:

  • Custom Ontologies: An ontology is a formal, machine-readable map of a specific domain. It defines the key entities, their attributes, and the relationships between them. While Schema.org provides a general vocabulary, a custom ontology allows you to create a much more detailed and specific schema for your unique content. This is particularly useful for specialized sites with precise information beyond Schema: Think pharmaceuticals, banking, and financial services. 
  • Internal Knowledge Graphs: An internal knowledge graph connects all of your content’s entities and their relationships. It’s your own private version of Google’s Knowledge Graph that creates an interconnected web of your content that makes it semantically complete.  
  • Structured Content CMS: Traditional CMS platforms are often page-centric. Structured CMS allows you to create entities (e.g., “Richmond, VA”) and map them across multiple pieces of content. This makes maintaining an internal knowledge graph easier and can significantly enhance AI’s understanding of your content. 

What to Focus On:

  • Use schema markup wherever applicable.
  • Implement structured data types that align with your content. Some of the most impactful for GEO include:
  • Be comprehensive, not just compliant: It’s not enough to pass validation tools like Rich Results Test. The more fully you define entities, attributes, and relationships, the more context you provide for AI to extract and reuse your content.
  • Add and maintain accurate meta descriptions: While not a ranking factor, meta descriptions often appear in traditional search snippets and can influence how AI systems summarize or preview your content. Make sure they are concise, descriptive, and aligned with the content’s purpose.
  • Use clear heading hierarchy and internal structure: Proper use of <h1>, <h2>, and <p> tags helps both search engines and LLMs segment and interpret content. This kind of structural clarity supports chunking, summarization, and entity extraction.
  • Avoid overuse of generic or irrelevant markup: Don’t tag everything. Misusing structured data (like applying FAQPage markup to a list of internal links) may result in Google ignoring it. Focus on honest, well-aligned markup that reflects actual page content.

Forum and UGC Prioritization

For queries involving troubleshooting, product comparisons, lived experiences, or niche use cases, user-generated content (UGC) and forum discussions are often prioritized by AI systems. Generative models value this type of content because it reflects authentic, diverse, and situational insights that can’t always be found in more polished corporate content.

This trend has become more visible with Google’s Hidden Gems update and the increasing appearance of Reddit and Quora excerpts in AI Overviews and conversational results.

What to Focus On:

  • Understand when UGC is preferred. AI systems tend to surface forum or user discussion content for:
    • Technical troubleshooting and workarounds
    • First-hand product feedback
    • Real-world usage tips
    • “What’s the best…?” or “Has anyone tried…?” type queries
  • Encourage structured contributions on your platform. If you manage a site that contains user input (e.g., reviews, Q&As, forums), prompt contributors to:
    • Use full sentences
    • Include specific outcomes or setups (“when I used X on a Mac M1…” rather than just “didn’t work”)
  • Separate multipart answers with line breaks or bullet points. AI models favor structured language, because it’s easier to extract, summarize, and rephrase.
  • Mark up UGC with clear schema where possible: Use schema.org for Review, QAPage, or DiscussionForumPosting to help search and AI systems identify user responses and rank them appropriately.
  • Optimize for content utility. For UGC-heavy queries, the rawness of the response can be a strength. AI is trained to detect utility signals like:
    • Whether the answer solves the user’s problem
    • If it includes steps or explanations
    • If others upvoted or replied to it (i.e., engagement as a signal of quality)

Monitor how AI surfaces public UGC: AI Overviews and Perplexity frequently quote Reddit threads, YouTube comments, and niche forums. Tracking when and where this happens provides insight into how informal content is influencing generative summaries.

 

AI engines are increasingly looking beyond corporate blogs and product pages to answer real human questions. For GEO, this means content strategy should account for where and how your audience is sharing insights. 

High-Quality, Entity-Rich, Embedding-Friendly Language

In traditional SEO, content relevance often meant placing the right keywords in the right spots. But in the context of GEO, keyword density matters less than clarity, relevance, and how well your content maps into vector space. 

Generative AI systems work by encoding language into vector representations called embeddings. These embeddings capture the relationships between concepts, not just words. The clearer and more semantically rich your content, the easier it is for AI models to parse, understand, and reuse it.

high quality, entity rich, embedding friendly language

What to Focus On:

  • Write with clearly defined entities: Use precise language that identifies the main subject or concept being discussed. For example, instead of “this tool,” say “Google Search Console.” Named entities (like brands, people, products, and places) help LLMs resolve meaning more effectively.
  • Use consistent terminology: Pick one term for each concept and use it consistently across your content. LLMs can struggle with synonyms or ambiguous phrases. Repetition of precise terms strengthens the entity embedding.
  • Include modifiers and descriptors: Qualifiers like size, function, location, and purpose help differentiate similar entities. For instance, “enterprise SEO agency” conveys more meaning than just “agency.”

Clarity fuels visibility in generative systems. Your goal is to write in a way that helps the model make accurate, meaningful associations between topics. This makes your content more retrievable and more useful as part of the AI’s response.

It doesn’t stop there, though. Considerations when creating quality content can go much deeper. 

Tokenization

Tokenization is the process of splitting text into smaller units called tokens. These can be words, subwords, or even characters. It’s a foundational step in most natural language processing (NLP) tasks, crucial for analyzing text, calculating keyword density, and preparing input for models like BERT. Tokenization can also be used to protect sensitive data, or to process large amounts of data.

Example: 

For the sentence, “Google Search is evolving with AI Overviews”, tokenization might produce the tokens “Google”; “Search”; “is”; “evolving”; “with”: “AI”: “Overviews”; and “.”

POS Tagging

Part of Speech (POS) tagging assigns a grammatical category (e.g., noun, verb, adjective, adverb) to each word in a sentence. This helps the model with understanding the syntactic structure of the text, which is fundamental for more complex NLP tasks like dependency parsing, named entity recognition, and information extraction (which we’ll get into below). 

It also works well for clarifying ambiguity in terms with numerous meanings and showing a sentence’s grammatical structure, which contributes to better semantic understanding for AI Search.

Example

For the sentence “Optimizing content helps improve visibility in AI-driven search,” POS tagging might label “Optimizing” as a verb, “content” as a noun, “helps” as a verb, “improve” as a verb, “visibility” as a noun, and so on.

Named Entity Recognition

Named entity recognition (NER) is the task of identifying and classifying named entities (persons, organizations, locations, dates, etc.) in text. Crucial for semantic search, knowledge graph construction, content categorization, and understanding key concepts mentioned in a document, NER is a big part of chatbots, sentiment analysis tools, and search engines. It’s often used in industries such as healthcare, finance, human resources, customer support, and higher education.

Example: 

In the sentence, “Google and OpenAI are leading companies in the AI search space,” NER would identify “Google” as an ORG (Organization) and also “OpenAI” as an ORG.

Lemmatization vs. Stemming

Lemmatization and stemming are both ways of reducing words to their base or root form. They help information-retrieval systems and deep learning models identify related words in tasks such as text classification, clustering, and indexing.

  • Lemmatization reduces words to their dictionary form (lemma), ensuring the root word is a valid word itself and considering the word’s meaning. 
  • Stemming is a cruder process that chops off suffixes from words to get to a root form (stem). This stem might not be a valid word, though.

Lemmatization is generally preferred for semantic tasks in both SEO and AI Search because it retains meaning better, leading to more accurate keyword matching and understanding.

Example: 

For the sentence “Users were searching for optimized articles regularly,”

  • Stemming might yield: “user”; “were”; “search”; “for”; “optim”; “articl”; “regular”
  • Lemmatization might yield: “user”; “be”; “search”; “for”; “optimize”; “article”; “regularly”

Semantic Chunking

Generative engines pull sections (a sentence, paragraph, or list) and use them to construct answers. So if your content is buried in a long-form narrative, it may be skipped. If it’s cleanly chunked and self-contained, on the other hand, it becomes far more usable.

To improve its chances of being included in generative responses, your content needs to be divided into clear, self-contained chunks, each expressing a complete idea on its own. This approach is referred to as semantic chunking.

What to Focus On:

  • One idea per paragraph: Each paragraph should clearly convey a single point. Avoid blending multiple concepts in a single block. Generative systems like Gemini and ChatGPT segment pages by paragraph, and often select one at a time for summarization.
  • Use bullets and lists for clarity: Bullet points, checklists, and step-by-step instructions signal hierarchy and help the model understand the relationship between ideas.
  • Use table rows and labeled data blocks: Tables break information into predictable, digestible formats. Use them to list comparisons, feature sets, definitions, or data summaries — but make sure each row is meaningful even when read on its own.
  • Avoid context-dependent phrasing: Sentences that rely on pronouns like “this,” “that,” or “it” without clearly defined subjects can lose meaning when lifted from their original source. Use specific nouns, and restate key terms to ensure that each chunk works independently.
  • Add concise headings before content blocks: Headings help AI models group related content and understand the scope of each section. They also act as markers when the model is choosing which chunk to surface.

Think of every paragraph, bullet, or table row as a potential answer on its own. Semantic chunking makes your content more extractable, more quotable, and more likely to appear in summaries, featured answers, or conversational results across AI-driven platforms.

Semantic Triples

As generative engines get more sophisticated, they rely more on structured relationships between concepts. One of the most effective ways to support this is by writing in semantic triples: simple subject-predicate-object phrases that state facts clearly.

Semantic triples help search engines understand context better by identifying entities, establishing connections, and building a web of interconnected concepts, which provide richer contextual information than just keywords. These triples are the building blocks of knowledge graphs, which allow AI systems to understand relationships between entities, enabling more intelligent search results, factual verification, and structured data for AI Overviews.

What to Focus On:

  • Write clear subject-predicate-object statements: They help models like Gemini and Claude identify and map entities into structured relationships.
    • “Paris is located in France.”
    • “ChatGPT was created by OpenAI.”
    • “Schema markup improves content discoverability.”
  • Use consistent nouns and verbs: Stick to regular specific terms for key subjects and actions. Repetition reinforces clarity in vector space and helps the AI model map recurring relationships.
  • Make each sentence a complete, self-contained idea: Avoid vague references like “this” or “that.” Instead of “this improves visibility,” say “schema markup improves visibility in search results.”
  • Use simple, readable language: AI performs better when the wording is direct and free of unnecessary complexity. Avoid jargon unless you also define it.
  • Keep sentences short and paragraphs tight: Brief, clear passages are easier for AI to chunk and summarize accurately. They also can help readers skim and retain key points.

Dependency Parsing

Dependency parsing analyzes the grammatical structure of a sentence by showing how words relate to each other as “heads” and “dependents.” It creates a tree-like structure, revealing the syntactic relationships between words (e.g., which word modifies which, or subject-verb relationships). For AI Search, this is crucial for understanding sentence meaning, coreference resolution, and accurate information extraction.

A dependency typically involves two words: one that acts as the head and another that acts as the child.

Example: 

For the sentence “The quick brown fox jumps over the lazy dog,” dependency parsing would show that “quick” and “brown” modify “fox,” “jumps” is the root verb, “fox” is the subject of “jumps”, and “dog” is the object of “over.”

Daniel Jurafsky and James Martin of Stanford University created this diagram to map out the different parts of dependency parsing:

Co-reference Resolution

Co-reference resolution is the task of identifying all expressions that refer to the same real-world entity in a text. In “John Doe went to the store. He bought milk,” we refer to linguistic expressions like “he” or “John” or “Doe” as mentions or referring expressions, and “John Doe” as the referent. Two or more expressions that refer to the same discourse entity are said to co-refer.

Co-reference is vital for AI Search to understand the full context of a document, know who is being discussed in text, accurately summarize information, and answer complex questions in situations where pronouns or synonyms are used to refer to the same entity.

Example: 

Take the text: “Google announced a new AI model. The company expects it to revolutionize search. They plan to roll it out next year.” Co-reference resolution would link “Google”; “The company”; and “They” to the same entity (Google).

Keyword Extraction (TF-IDF, TextRank)

Keyword extraction is an automated information-processing task that identifies the most important words or phrases in a text to provide a summary of the text. Two keyword extraction techniques include:

  • TF-IDF (Term Frequency-Inverse Document Frequency): A statistical measure that evaluates how relevant a word is to a document within a collection of documents. It increases with the number of times a word appears in the document, but is offset by the frequency of the word in the corpus.
  • TextRank: A graph-based ranking algorithm that identifies important sentences or keywords by analyzing the co-occurrence of words.

Both are important for understanding the main topics of a document, optimizing content for specific keywords, and informing content strategy for SEO and AI Search.

Example: 

For a blog post titled “The Future of AI in SEO,” keyword extraction might identify terms like “AI”; “SEO”; “future”; “search”; “optimization”; “ranking”; etc.

Topic Modeling

Topic modeling algorithms discover abstract “topics” that occur in a collection of documents. They automatically cluster words that often occur together in the documents, with the goal of identifying groups of words and the underlying themes and topics.

Topic modeling in vector space

Some of the more popular models include:

  • Latent Dirichlet Allocation (LDA): A generative probabilistic model that assumes documents are a mixture of topics and topics are a mixture of words 
  • Non-negative Matrix Factorization (NMF): A linear algebra technique that decomposes a document-term matrix into two matrices, representing document-topic and topic-word distributions (this and LDA are both useful for topic modeling on lengthy textual data)
  • BERT-Based Topic Modeling (BERTopic): Leverages Transformer embeddings to create dense document representations, then clusters these embeddings to find topics

Topic modeling is useful for content-gap analysis, understanding user intent across queries, grouping similar content, and informing content-cluster strategies for SEO.

Example: 

Analyzing a set of SEO articles might reveal topics such as “Link Building Strategies,” “On-Page SEO Optimization,” “Technical SEO Audits,” and “Content Marketing for SEO.”

Sentiment Analysis

Sentiment analysis (or opinion mining) determines the emotional tone behind a piece of text, be it positive, negative, or neutral. 

In SEO, sentiment analysis can be used to analyze customer reviews, social media mentions, and competitor content to gauge brand perception and identify areas for improvement. For AI Search, understanding sentiment can influence result ranking and personalized recommendations.

Example: 

Here’s an example of sentiment analysis on customer reviews: 

  • “This tool is amazing, highly recommend!” = Positive 
  • “The customer support was terrible.” = Negative 
  • “The article provided information.” = Neutral

Text Summarization

Text summarization condenses longer texts into shorter, more coherent versions. To do this, it uses two different methods:

  • Extractive summarization identifies and extracts key sentences or phrases directly from the original text to form the summary.
  • Abstractive summarization generates new sentences and phrases to create a summary of important information that may not be present in the original text. This method, which often requires advanced natural language understanding (NLU) models, often gives better results in situations where information is confusing or unstructured.

Summarization is critical to generating AI Overviews, creating meta descriptions, summarizing long articles for quick review, and producing concise content snippets for AI Search results.

Example: 

For a long article called “Machine Learning in Search Engines,” an extractive summary might pick out the main topic sentences, while an abstractive summary might synthesize a new, concise overview.

Entity Linking/Disambiguation

Entity linking (aka entity disambiguation) is the process of mapping named entities extracted from text to their unique, unambiguous entries in a knowledge base. 

Entity linking is crucial for semantic search, as it ensures that search engines understand the exact entity a query refers to, leading to more precise results and a richer understanding of content for AI systems.

Example: 

In the sentence, “Apple released a new iPhone,” “Apple” would be linked to Apple Inc. (the organization). In “I ate an apple,” “apple” would be linked to an apple (the fruit).

Text Classification

Text classification is the task of assigning predefined categories or labels to pieces of text, which allows computers to interpret and organize large amounts of data. It is highly versatile and can be used for:

  • Spam Detection: Classifying emails or comments as spam or not spam
  • Content Categorization: Assigning articles to topics (e.g., “technology,” “finance,” “health”)
  • User Intent Classification: Determining the purpose behind a user’s query

In SEO, text classification helps search categorize content for better organization, identify low-quality content, and understand the thematic relevance of pages. In AI Search, it aids in filtering irrelevant results and structuring information for better retrieval.

Example:

  • News article = “technology” category
  • Blog comment of “Great post!” = “not spam”

Word Embeddings

Word embeddings are dense vector representations of words that capture their semantic meanings. Words with similar meanings are located closer to each other in this multidimensional space, which helps with tasks such as text classification, sentiment analysis, and machine translation.

Gemini embedding, an advanced embedding model developed by Google DeepMind and built on Gemini, offers a unified approach to generate rich, context-aware embeddings for various text granularities, from words to longer phrases. It does so for text in over 250 languages and can also code. 

Gemini embeddings can be used for tasks like classification, similarity search, clustering, ranking, and retrieval.

Example: 

The embedding for “king” would be semantically close to “queen” and “prince,” while the vector arithmetic “king – man + woman” would be close to “queen.”

Document Embeddings

Document embeddings (or sentence embeddings) are vector representations that capture the semantic meaning of entire documents or sentences. They allow for comparing the similarity between larger chunks of text.

Three methods for generating document embeddings are:

  • Doc2Vec: A technique that maps each document to a fixed-length vector, enabling the user to capture the semantic meaning of entire documents or paragraphs
  • Sentence-BERT: An improvement of the original BERT model that uses siamese and triplet network structures to generate semantically meaningful sentence embeddings
  • Universal Sentence Encoder (USE): A pre-trained text module providing sentence-embedding models that convert sentences into vector representations

Example: 

A document embedding for an article about “sustainable energy” would be close to embeddings for other articles on renewable resources, but far from articles about “ancient Roman history.”

Plagiarism Detection

Plagiarism detection identifies instances where text has been copied without proper attribution. Leveraging Gemini embeddings allows for a robust semantic plagiarism check, detecting not just exact copies but also highly similar rephrased content. This is vital for maintaining content originality and avoiding search engine penalties.

Example: 

Comparing a newly generated article against a corpus of existing articles to detect copied phrases or paragraphs based on semantic closeness.

Anomaly Detection

Anomaly detection identifies unusual patterns or outliers in data. In NLP for SEO, this can be applied to content quality by detecting:

Anomaly detection
  • Sudden drops in readability scores
  • Unusual keyword-stuffing patterns
  • Abnormally low or high word counts for a content type
  • Spikes in negative sentiment in reviews

This helps with proactive identification of potential content issues that could impact SEO performance or indicate a need for review, such as errors, unusual events, or potential fraud.

Example: 

A sudden spike in the use of a seemingly irrelevant keyword across multiple articles, or a review with an extreme sentiment score compared to others.

Readability Scoring

Readability scoring assesses how easy it is to read and understand a text. In SEO, optimizing for readability improves user experience, reduces bounce rates, and makes content more accessible, all of which indirectly signals quality to search engines and is a direct factor for AI Overviews.

Readability tests include:

All of these metrics consider factors such as sentence length, word length, and syllable count to determine the approximate reading level of a text or how many years of education a person would need to understand it. 

Example: 

A complex academic paper would have a low readability score, while a simple blog post would have a high one.

Semantic Search (Vector Search)

Semantic search understands the meaning and intent behind a query, moving beyond keyword matching. It uses powerful embeddings like Gemini’s to find documents that are semantically similar to the query, even if exact keywords are absent. This is the cornerstone of modern AI-powered search engines, delivering more relevant and nuanced results.

Example: 

A search for “sustainable energy sources” might return results about “renewable power,” “solar panels,” or “wind farms,” even if the exact phrase “sustainable energy sources” isn’t present in the documents.

Where do you start with GEO?

There is no one-size-fits-all formula for visibility in AI Search, but the patterns are becoming clear. Structured data, semantic clarity, specific language, and technical accessibility all play a role in how content is evaluated and used by AI systems, which are trained to understand not just words but meaning, context, and usefulness.

GEO sits at the intersection of technical SEO, content strategy, and NLP. Getting it right means knowing how models interpret the web and giving them content they can trust, extract, and reuse. 

Creating this content requires a focus on relevance. Engineering the most relevant content for visibility involves looking at semantic scoring, optimizing passages, and testing vector embeddings. In the next chapter, we’ll look more deeply at the process of Relevance Engineering.

We don't offer SEO.

We offer
Relevance
Engineering.

If your brand isn’t being retrieved, synthesized, and cited in AI Overviews, AI Mode, ChatGPT, or Perplexity, you’re missing from the decisions that matter. Relevance Engineering structures content for clarity, optimizes for retrieval, and measures real impact. Content Resonance turns that visibility into lasting connection.

Schedule a call with iPullRank to own the conversations that drive your market.

MORE CHAPTERS

Part IV: Measurement and Reverse Engineering for GEO

» Chapter 12

» Chapter 13

» Chapter 14

» Chapter 15

Part V: Organizational Strategy for the GEO Era

» Chapter 16

» Chapter 17

Part VI: Risk, Ethics, and the Future of GEO

» Chapter 18

» Chapter 19

» Chapter 20

APPENDICES

The appendix includes everything you need to operationalize the ideas in this manual, downloadable tools, reporting templates, and prompt recipes for GEO testing. You’ll also find a glossary that breaks down technical terms and concepts to keep your team aligned. Use this section as your implementation hub.

//.eBook

The AI Search Manual

The AI Search Manual is your operating manual for being seen in the next iteration of Organic Search where answers are generated, not linked.

Want digital delivery? Get the AI Search Manual in Your Inbox

Prefer to read in chunks? We’ll send the AI Search Manual as an email series—complete with extra commentary, fresh examples, and early access to new tools. Stay sharp and stay ahead, one email at a time.

Want the AI Search Manual

In Bites-Sized Emails?

We’ll break it up and send it straight to your inbox along with all of the great insights, real-world examples, and early access to new tools we’re testing. It’s the easiest way to keep up without blocking off your whole afternoon.