Back to guides

How Do I Build a Knowledge Graph From My Company's Docs Using Claude?

Jake McCluskeyIntermediate60 min read
How Do I Build a Knowledge Graph From My Company's Docs Using Claude?

"Knowledge graph" sounds like one of those buzzwords that means whatever the speaker wants it to mean. It's actually one of the most concrete ideas in AI: store your data as connected things, not as rows. Once you do, an LLM can answer questions your SQL database never could — "which of our enterprise customers are headquartered in cities where we have AEs?" — without you having to write the join. Here's how to build a small, working knowledge graph from your company's docs in an afternoon.

Why this matters

The job-to-be-done with a knowledge graph isn't "store data better." It's "let an LLM answer multi-hop questions without you scripting every possible query." Every consulting client I've helped with this has the same shape: a wiki with hundreds of pages, a CRM with thousands of accounts, and a CEO who keeps asking questions that span both. SQL can't answer "show me deals from accounts mentioned in the Q3 strategy doc." A knowledge graph can.

The trick is the data model: every fact is a triple(subject, relation, object). "Acme → headquartered_in → Austin." "Austin → in_state → Texas." "Sarah → manages → Acme." Once enough triples exist, traversing the graph answers questions that would have needed a 12-table join.

Before you start

You need:

  • Python 3.10+ and a virtualenv.
  • An Anthropic API key for the extraction step.
  • Neo4j Desktop installed locally — free at neo4j.com/download. Or run it in Docker: docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/changeme neo4j:5.
  • A folder of source documents — Markdown, Notion exports, or Google Docs as DOCX. I'll use ~/docs/ in the examples.
  • About 60 minutes end-to-end.
bash
python -m venv .venv && source .venv/bin/activate
pip install neo4j anthropic python-frontmatter

Step 1: Decide your schema first

Before any code: write down the entity types and relations you care about. For a company knowledge graph, a starting set:

  • Entities: Person, Company, Product, Project, Document, Topic.
  • Relations: WORKS_AT, OWNS, BUILT, MENTIONS, DEPENDS_ON, HEADQUARTERED_IN.

Six entity types and six relations is plenty for a first cut. You'll add more once you see what's missing — but starting tight stops the graph from drifting into "everything connected to everything."

Write this schema in a schema.md next to your code. The LLM extractor reads it and stays on the rails.

Step 2: Spin up Neo4j and create constraints

Open Neo4j Desktop or hit the Docker URL http://localhost:7474. Log in with the password you set, open a query tab, and create unique constraints so duplicate entities collapse instead of multiplying:

text
CREATE CONSTRAINT person_name IF NOT EXISTS FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT company_name IF NOT EXISTS FOR (c:Company) REQUIRE c.name IS UNIQUE;
CREATE CONSTRAINT product_name IF NOT EXISTS FOR (p:Product) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT project_name IF NOT EXISTS FOR (p:Project) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT document_title IF NOT EXISTS FOR (d:Document) REQUIRE d.title IS UNIQUE;
CREATE CONSTRAINT topic_name IF NOT EXISTS FOR (t:Topic) REQUIRE t.name IS UNIQUE;

Step 3: Extract triples from each document with Claude

Loop your source docs through Claude with a prompt that returns JSON triples conforming to your schema.

python
import os
import json
import pathlib
from anthropic import Anthropic

client = Anthropic()
SCHEMA = pathlib.Path("schema.md").read_text()

EXTRACT_PROMPT = """You are extracting a knowledge graph from one document.

Return ONLY valid JSON of shape:
{
  "entities": [{"type": "Person|Company|Product|Project|Document|Topic", "name": "..."}],
  "triples":  [{"subject": "...", "relation": "WORKS_AT|OWNS|BUILT|MENTIONS|DEPENDS_ON|HEADQUARTERED_IN", "object": "..."}]
}

Schema reference:
""" + SCHEMA + """

Document:
"""

def extract(text: str) -> dict:
    msg = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=2000,
        messages=[{"role": "user", "content": EXTRACT_PROMPT + text}],
    )
    return json.loads(msg.content[0].text)

docs_dir = pathlib.Path.home() / "docs"
all_triples = []
for path in docs_dir.glob("**/*.md"):
    body = path.read_text()
    result = extract(body)
    all_triples.append({"source": path.name, **result})

pathlib.Path("triples.json").write_text(json.dumps(all_triples, indent=2))
print(f"Extracted from {len(all_triples)} docs")

This runs serially. For larger corpora, batch it through Anthropic's Batch API — same output at half the price.

Step 4: Load the triples into Neo4j

python
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "changeme"))

def load_triples(tx, triples):
    for t in triples:
        for ent in t["entities"]:
            tx.run(f"MERGE (n:{ent['type']} {{name: $name}})", name=ent["name"])
        for tr in t["triples"]:
            tx.run(
                f"MATCH (a {{name: $sub}}), (b {{name: $obj}}) "
                f"MERGE (a)-[:{tr['relation']}]->(b)",
                sub=tr["subject"], obj=tr["object"],
            )

with driver.session() as session:
    for batch in all_triples:
        session.execute_write(load_triples, [batch])

MERGE is the magic verb — it creates the entity or relation if missing, no-ops if present. That's how the graph stays clean as you re-run the extractor.

Step 5: Query the graph in plain English (with Claude)

The payoff. Wrap a query function that takes a question, asks Claude to translate it to Cypher, runs it, and summarizes the result.

python
def ask(question: str) -> str:
    schema_q = """Translate this question into a single Cypher query against the schema. Return ONLY the Cypher.

Schema:
""" + SCHEMA + "\n\nQuestion: " + question

    msg = client.messages.create(model="claude-sonnet-4-5-20250929", max_tokens=400,
                                 messages=[{"role": "user", "content": schema_q}])
    cypher = msg.content[0].text.strip().strip("`")

    with driver.session() as session:
        rows = list(session.run(cypher))

    answer_q = f"Question: {question}\nCypher: {cypher}\nResults: {rows}\n\nAnswer in one sentence."
    final = client.messages.create(model="claude-sonnet-4-5-20250929", max_tokens=300,
                                   messages=[{"role": "user", "content": answer_q}])
    return final.content[0].text

print(ask("Which projects mention Claude prompt caching?"))

Verify it worked

Three checks:

  1. triples.json has content — a quick wc -l triples.json confirms extraction ran.
  2. Neo4j has nodes and edges. Run MATCH (n) RETURN count(n) in Neo4j Browser. Should be > 0. Then MATCH ()-[r]->() RETURN count(r).
  3. ask() returns a useful answer. Try a multi-hop question your SQL DB can't answer ("Which products are owned by companies that mention 'cost optimization' in their docs?"). If the answer references the right entities, the graph is doing real work.

Where this breaks

  • Entity duplication when names are inconsistent. "Anthropic" vs "Anthropic, Inc." vs "anthropic" become three nodes. Pre-normalize names (lowercase, strip suffixes) before MERGE, or run a deduplication pass with Claude after load.
  • Schema drift over time. Documents from a new domain produce relations you didn't define. Either add them to schema.md and re-extract, or have the extractor reject anything off-schema (stricter, more reliable).
  • Cypher hallucination. Claude occasionally writes Cypher that references a label you don't have. Always validate the query with EXPLAIN before running on a big graph.
  • Cost on large corpora. A 5,000-document extract at full price gets expensive. Use the Batch API for 50% off.

What to try next

Want this built for you instead?

Let's talk about your AI + SEO stack

If you'd rather skip the how-to and have it shipped for you, that's what I do. Start a conversation and we'll figure out the fastest path to results.

Let's Talk
Questions from readers

Frequently asked

Do I have to use Neo4j?

No. Memgraph, ArangoDB, or even a SQLite-backed RDF store works. Neo4j is the easiest to install and the Cypher query language is the most LLM-friendly because Claude has seen lots of it in training.

How is this different from a vector database with RAG?

RAG retrieves chunks of text by similarity. A knowledge graph retrieves entities by relationship. They're complementary — RAG for 'find docs about X', graph for 'which X are connected to Y'. Production systems often run both.

Won't Claude hallucinate entities that aren't in the docs?

Sometimes. Two defenses: keep the schema tight (only allow relations you defined) and run a verification pass — give Claude back the extracted triples plus the source doc and ask 'are these all supported?' The false positive rate drops fast.

How big can a knowledge graph get before it slows down?

Neo4j handles millions of nodes/edges fine on a single machine if you have indexes on the lookup properties. The bottleneck is usually the extraction step, not the graph itself.

Can I update the graph incrementally as docs change?

Yes — that's why we use MERGE in the Cypher loader. Re-running the extractor on a changed doc adds new triples and dedupes existing ones. For deletions, you'll need a separate cleanup pass that finds orphan nodes.