How Does AI Document Review Work in Litigation?
Blog Post

How Does AI Document Review Work in Litigation?

Jake McCluskey
Back to blog

AI document review in litigation uses machine learning to classify, prioritize, and surface relevant documents from massive collections. You feed the system a sample of coded documents (your "seed set"), the algorithm learns what makes a document responsive or privileged, then scores the rest of your corpus so you review high-priority material first and defensibly cull the rest. In 2025, large language models reached parity with traditional technology-assisted review (TAR) on accuracy while cutting per-document review cost by roughly 30 to 40 percent. That's why the buy decision shifted from "should we use AI?" to "which flavor of AI survives a Rule 26 challenge?"

What Is Technology-Assisted Review (TAR) and How Does It Work

TAR is the umbrella term for any process where software helps you decide which documents are responsive, privileged, or irrelevant. The classic implementation is predictive coding: you manually review a seed set of 500 to 2,000 documents, code them for responsiveness, and train a classifier to predict coding decisions for the rest of your corpus.

The algorithm looks for patterns in word frequency, metadata, custodian behavior, and document structure. When it sees a new email thread with similar language to your seed set, it assigns a relevance score. You review the highest-scoring documents first, feed those decisions back into the model (that's continuous active learning, or CAL), and stop when the marginal gain drops below your agreed threshold.

A mid-market employment dispute with 200,000 documents might require 2,000 seed-set reviews, 8,000 first-pass reviews, and 1,500 quality-control spot checks. You've now reviewed 11,500 documents instead of 200,000. And you can defend the process because you measured recall (the percentage of truly responsive documents you captured) at 75 to 85 percent, which courts accept as proportional under Rule 26.

Why the TAR-to-LLM Shift Matters for Your Next Production

Traditional TAR classifiers are pattern matchers. They're fast and defensible, but they don't understand context the way a junior associate does. If your seed set codes all emails from the CFO as responsive, the classifier will flag every CFO email, even the lunch plans.

LLMs changed that in 2025 when GPT-4 and Claude 3 Opus hit 80+ percent accuracy on legal relevance tasks without fine-tuning. They read documents the way humans do: they parse conditional clauses, recognize sarcasm, distinguish between a draft and a final version. They flag privilege based on attorney involvement rather than keyword proximity.

The cost curve flipped because LLM-assisted review reduces second-pass review hours. You still need a seed set, but the model's contextual reasoning cuts false positives by 35 to 50 percent compared to keyword-plus-classifier workflows. That means fewer associates spending $300/hour to triage irrelevant hits, and faster time to production when you're racing a court-ordered deadline.

The tradeoff is explainability. A TAR classifier shows you term weights and document similarity scores. An LLM gives you a relevance judgment and a two-sentence explanation that may or may not satisfy opposing counsel. You'll want counsel sign-off on the prompts, plus a sampling protocol that proves the LLM isn't hallucinating privilege calls.

How AI Litigation Document Review Actually Works: The Four-Stage Workflow

Here's the process you're buying when you contract for AI-assisted review, whether you're using a legacy TAR platform or a 2025-vintage LLM tool.

Stage 1: Seed Set Creation and Manual Coding

You or your vendor pulls a statistically representative sample from your corpus. For a 100,000-document breach-of-contract case, that's typically 1,000 to 1,500 documents. Your review team codes each one for responsiveness, privilege, and any custom issue tags (hot docs, confidentiality tier, custodian).

This is where you teach the machine what "relevant" means for this case. If you skimp here and code only 300 documents, your model will miss edge cases and you'll burn hours on quality control later. Budget 40 to 60 hours of senior associate time for seed set coding. Add another 10 hours for a partner to spot-check and approve the training labels.

Stage 2: Model Training and Scoring

The platform ingests your seed set and trains a model. In TAR 1.0, that's a support vector machine or logistic regression classifier. In TAR 2.0 (CAL), the model updates continuously as you review. In LLM-assisted review, the vendor fine-tunes a foundation model or runs zero-shot classification with carefully engineered prompts.

The model scores every document in your corpus on a 0-to-100 relevance scale. Documents above 70 go into your high-priority queue. Documents below 30 are presumed irrelevant and go into a discard pile that you'll sample for quality control. Everything in the middle gets human review or a second-pass LLM check.

Stage 3: Prioritized Review and Continuous Learning

Your team reviews the high-scoring documents first. As you code, the platform updates the model (if you're using CAL) or logs your decisions for the next training cycle. You're looking for two things: recall (are we catching the responsive stuff?) and precision (are we wasting time on junk?).

Most platforms let you set a "review until" threshold: stop when the model predicts that 95 percent of remaining responsive documents have been reviewed, or when the last 500 documents yielded fewer than 10 responsive hits. That's your proportionality argument when opposing counsel complains you didn't review every document.

Stage 4: Quality Control and Privilege Pass

You sample the discard pile to measure recall. Industry standard is 1,500 to 2,000 random samples from the low-scoring documents. If you find more than 5 percent responsiveness in that sample, you either lower your cutoff score or explain to the court why the marginal cost of reviewing another 20,000 documents outweighs the likely yield.

Privilege review is where LLMs shine and where you need the tightest controls. An LLM can spot attorney-client communication that a keyword search would miss (Subject: "Tuesday meeting" between in-house counsel and CEO). But if it hallucinates and marks a business email as privileged, you've just blown a waiver argument. You'll want a second-pass human review on every LLM privilege call. That limits your cost savings but keeps you out of sanctions hearings.

LLM Document Review vs TAR: What Changed and What You're Actually Buying

The practical difference comes down to how the system decides what's relevant. TAR classifiers look for statistical patterns. LLMs read for meaning. That matters most in three scenarios.

First, privilege review. A classifier flags emails with "attorney," "privileged," or "legal advice" in the text. An LLM reads the substance and catches the CEO asking the GC for litigation strategy in an email with Subject: "Q3 planning." In a 50,000-document regulatory response, that difference saves 15 to 25 hours of associate time hunting down privilege false negatives.

Second, conditional relevance. If your discovery request asks for "communications regarding the decision to terminate," a classifier will flag every email mentioning "terminate" or "termination," including IT tickets about software licenses. An LLM distinguishes between termination of employment and termination of a contract, cutting your false positive rate by 40 to 60 percent.

Third, cross-document reasoning. TAR treats every document independently. An LLM can connect a vague email ("let's discuss the issue we talked about") to a prior thread and code it correctly. That's powerful in email-heavy cases, but it also introduces risk: if the LLM infers context that isn't there, you've coded a document based on a hallucination.

You're paying for speed and judgment, but you're also paying for explainability risk. A TAR vendor will show you feature weights and similarity scores that satisfy most judges. An LLM vendor will show you a confidence score and a summary that may not survive a Daubert-style challenge if opposing counsel pushes hard. The defensibility gap is closing as vendors build audit trails and sampling protocols. But it's not gone yet.

How Does Predictive Coding Work in eDiscovery and When Does Counsel Sign Off

Predictive coding is the original TAR method: you code a seed set, train a binary classifier (responsive/not responsive), and use the model's predictions to prioritize review. It's defensible because it's been blessed by courts since Da Silva Moore v. Publicis Groupe in 2012, and because you can measure recall and precision with statistical confidence intervals.

Counsel signs off at three checkpoints. First, after seed set coding: you review the sample, confirm it's representative, and approve the training labels. Second, after the model's first scoring pass: you review the top 500 documents to make sure the model isn't wildly off. Third, after the discard-pile sampling: you confirm that recall is acceptable (75 to 85 percent is standard) and document your proportionality rationale.

The proportionality argument you're making is this: manual review of 200,000 documents would cost $600,000 in associate time and take 12 weeks. AI-assisted review costs $180,000, takes 4 weeks, and captures 80 percent of responsive material. The 20 percent you're missing is low-relevance, high-cost material that a court won't force you to produce under Rule 26(b)(1). You'll want a declaration from your vendor or a consulting expert spelling this out. Opposing counsel will absolutely challenge your methodology if the stakes are high enough.

LLM-assisted review adds a fourth checkpoint: prompt approval. You'll want outside counsel to review the instructions you're giving the model, especially for privilege calls. If your prompt says "flag any email where an attorney provides legal advice," you need to define "legal advice" precisely enough that the model doesn't hallucinate privilege on every email from someone with "Esq." in their signature. I've seen vendors charge $15,000 to $25,000 just for prompt engineering and validation testing on a 100,000-document case. And honestly, it's worth every dollar if it keeps you out of a waiver fight.

What This Costs and What Fails in Production

Pricing for AI-assisted review breaks into platform fees, processing fees, and review labor. A TAR platform charges $8 to $15 per gigabyte for processing and hosting, plus $0.02 to $0.08 per document for ML scoring. An LLM-assisted platform charges $0.10 to $0.25 per document because inference costs are higher. For a 100,000-document case at 2GB, you're looking at $10,000 to $25,000 in platform and ML costs before you add review hours.

Review labor is where you save. Traditional manual review at $200 to $300 per hour runs $400,000 to $600,000 for 100,000 documents. TAR cuts that to $150,000 to $250,000. LLM-assisted review cuts it further to $100,000 to $180,000 because you're reviewing fewer false positives. The total cost delta is $220,000 to $420,000, which is why GCs are finally approving AI pilots after years of saying "maybe next case."

The failure modes are predictable. First, insufficient seed set diversity: you code 500 documents from one custodian, the model learns that custodian's writing style, and it misses responsive material from everyone else. Second, scope creep: you train a model for responsiveness, then halfway through review you add three new issue tags and the model has no training data for them. Third, privilege over-coding: an LLM flags 8,000 documents as privileged, you don't have budget to review them all, and you produce a privilege log with 3,000 false positives that opposing counsel will absolutely challenge.

Look, if you're evaluating AI document review for the first time, the decision isn't whether to use it (you should). It's whether to go with a mature TAR platform or a newer LLM tool. TAR is more defensible and cheaper per document if you have a clean seed set and a well-defined request. LLM-assisted review is faster and more accurate if you're dealing with ambiguous requests, privilege-heavy material, or cross-document reasoning. For a detailed comparison of legal AI tools and what actually works in production, see our breakdown of AI contract review limitations.

The other decision is whether to build internal capability or hire it out. Most mid-market firms don't have the caseload to justify a full-time eDiscovery team, which means you're either training your litigation paralegals on Relativity or contracting with a vendor who'll run the whole process for a flat fee. Vendor pricing ranges from $80,000 to $250,000 for a typical 100,000-document case, depending on complexity and turnaround time. If you want to understand what you're actually paying for when you hire outside AI help, our law firm AI consulting cost guide walks through the line items and the markup you should expect.

You now have the mental model you need to evaluate your next AI-assisted review proposal. Ask your vendor how they're measuring recall, what their privilege review protocol looks like, and whether they'll provide a defensibility declaration if opposing counsel challenges your methodology. If they can't answer those questions with specifics, walk.

Ready to stop reading and start shipping?

Get a free AI-powered SEO audit of your site

We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.

Run my free audit
WANT THE SHORTCUT

Need help applying this to your business?

The post above is the framework. Spend 30 minutes with me and we'll map it to your specific stack, budget, and timeline. No pitch, just a real scoping conversation.