How Do I Build a Knowledge Graph From My Company's Docs Using Claude?

"Knowledge graph" sounds like one of those buzzwords that means whatever the speaker wants it to mean. It's actually one of the most concrete ideas in AI: store your data as connected things, not as rows. Once you do, an LLM can answer questions your SQL database never could — "which of our enterprise customers are headquartered in cities where we have AEs?" — without you having to write the join. Here's how to build a small, working knowledge graph from your company's docs in an afternoon.
Why this matters
The job-to-be-done with a knowledge graph isn't "store data better." It's "let an LLM answer multi-hop questions without you scripting every possible query." Every consulting client I've helped with this has the same shape: a wiki with hundreds of pages, a CRM with thousands of accounts, and a CEO who keeps asking questions that span both. SQL can't answer "show me deals from accounts mentioned in the Q3 strategy doc." A knowledge graph can.
The trick is the data model: every fact is a triple — (subject, relation, object). "Acme → headquartered_in → Austin." "Austin → in_state → Texas." "Sarah → manages → Acme." Once enough triples exist, traversing the graph answers questions that would have needed a 12-table join.
Before you start
You need:
- Python 3.10+ and a virtualenv.
- An Anthropic API key for the extraction step.
- Neo4j Desktop installed locally — free at
neo4j.com/download. Or run it in Docker:docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/changeme neo4j:5. - A folder of source documents — Markdown, Notion exports, or Google Docs as DOCX. I'll use
~/docs/in the examples. - About 60 minutes end-to-end.
python -m venv .venv && source .venv/bin/activate
pip install neo4j anthropic python-frontmatterStep 1: Decide your schema first
Before any code: write down the entity types and relations you care about. For a company knowledge graph, a starting set:
- Entities:
Person,Company,Product,Project,Document,Topic. - Relations:
WORKS_AT,OWNS,BUILT,MENTIONS,DEPENDS_ON,HEADQUARTERED_IN.
Six entity types and six relations is plenty for a first cut. You'll add more once you see what's missing — but starting tight stops the graph from drifting into "everything connected to everything."
Write this schema in a schema.md next to your code. The LLM extractor reads it and stays on the rails.
Step 2: Spin up Neo4j and create constraints
Open Neo4j Desktop or hit the Docker URL http://localhost:7474. Log in with the password you set, open a query tab, and create unique constraints so duplicate entities collapse instead of multiplying:
CREATE CONSTRAINT person_name IF NOT EXISTS FOR (p:Person) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT company_name IF NOT EXISTS FOR (c:Company) REQUIRE c.name IS UNIQUE;
CREATE CONSTRAINT product_name IF NOT EXISTS FOR (p:Product) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT project_name IF NOT EXISTS FOR (p:Project) REQUIRE p.name IS UNIQUE;
CREATE CONSTRAINT document_title IF NOT EXISTS FOR (d:Document) REQUIRE d.title IS UNIQUE;
CREATE CONSTRAINT topic_name IF NOT EXISTS FOR (t:Topic) REQUIRE t.name IS UNIQUE;Step 3: Extract triples from each document with Claude
Loop your source docs through Claude with a prompt that returns JSON triples conforming to your schema.
import os
import json
import pathlib
from anthropic import Anthropic
client = Anthropic()
SCHEMA = pathlib.Path("schema.md").read_text()
EXTRACT_PROMPT = """You are extracting a knowledge graph from one document.
Return ONLY valid JSON of shape:
{
"entities": [{"type": "Person|Company|Product|Project|Document|Topic", "name": "..."}],
"triples": [{"subject": "...", "relation": "WORKS_AT|OWNS|BUILT|MENTIONS|DEPENDS_ON|HEADQUARTERED_IN", "object": "..."}]
}
Schema reference:
""" + SCHEMA + """
Document:
"""
def extract(text: str) -> dict:
msg = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2000,
messages=[{"role": "user", "content": EXTRACT_PROMPT + text}],
)
return json.loads(msg.content[0].text)
docs_dir = pathlib.Path.home() / "docs"
all_triples = []
for path in docs_dir.glob("**/*.md"):
body = path.read_text()
result = extract(body)
all_triples.append({"source": path.name, **result})
pathlib.Path("triples.json").write_text(json.dumps(all_triples, indent=2))
print(f"Extracted from {len(all_triples)} docs")This runs serially. For larger corpora, batch it through Anthropic's Batch API — same output at half the price.
Step 4: Load the triples into Neo4j
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "changeme"))
def load_triples(tx, triples):
for t in triples:
for ent in t["entities"]:
tx.run(f"MERGE (n:{ent['type']} {{name: $name}})", name=ent["name"])
for tr in t["triples"]:
tx.run(
f"MATCH (a {{name: $sub}}), (b {{name: $obj}}) "
f"MERGE (a)-[:{tr['relation']}]->(b)",
sub=tr["subject"], obj=tr["object"],
)
with driver.session() as session:
for batch in all_triples:
session.execute_write(load_triples, [batch])MERGE is the magic verb — it creates the entity or relation if missing, no-ops if present. That's how the graph stays clean as you re-run the extractor.
Step 5: Query the graph in plain English (with Claude)
The payoff. Wrap a query function that takes a question, asks Claude to translate it to Cypher, runs it, and summarizes the result.
def ask(question: str) -> str:
schema_q = """Translate this question into a single Cypher query against the schema. Return ONLY the Cypher.
Schema:
""" + SCHEMA + "\n\nQuestion: " + question
msg = client.messages.create(model="claude-sonnet-4-5-20250929", max_tokens=400,
messages=[{"role": "user", "content": schema_q}])
cypher = msg.content[0].text.strip().strip("`")
with driver.session() as session:
rows = list(session.run(cypher))
answer_q = f"Question: {question}\nCypher: {cypher}\nResults: {rows}\n\nAnswer in one sentence."
final = client.messages.create(model="claude-sonnet-4-5-20250929", max_tokens=300,
messages=[{"role": "user", "content": answer_q}])
return final.content[0].text
print(ask("Which projects mention Claude prompt caching?"))Verify it worked
Three checks:
triples.jsonhas content — a quickwc -l triples.jsonconfirms extraction ran.- Neo4j has nodes and edges. Run
MATCH (n) RETURN count(n)in Neo4j Browser. Should be > 0. ThenMATCH ()-[r]->() RETURN count(r). ask()returns a useful answer. Try a multi-hop question your SQL DB can't answer ("Which products are owned by companies that mention 'cost optimization' in their docs?"). If the answer references the right entities, the graph is doing real work.
Where this breaks
- Entity duplication when names are inconsistent. "Anthropic" vs "Anthropic, Inc." vs "anthropic" become three nodes. Pre-normalize names (lowercase, strip suffixes) before MERGE, or run a deduplication pass with Claude after load.
- Schema drift over time. Documents from a new domain produce relations you didn't define. Either add them to schema.md and re-extract, or have the extractor reject anything off-schema (stricter, more reliable).
- Cypher hallucination. Claude occasionally writes Cypher that references a label you don't have. Always validate the query with
EXPLAINbefore running on a big graph. - Cost on large corpora. A 5,000-document extract at full price gets expensive. Use the Batch API for 50% off.
What to try next
Let's talk about your AI + SEO stack
If you'd rather skip the how-to and have it shipped for you, that's what I do. Start a conversation and we'll figure out the fastest path to results.
Let's Talk