How to Build Visual Search with CLIP Embeddings Step by Step
Blog Post

How to Build Visual Search with CLIP Embeddings Step by Step

Jake McCluskey
Back to blog

You can build a production-ready visual search system using OpenAI's CLIP model that runs entirely on your own hardware without any API costs or cloud dependencies. This tutorial walks you through installing CLIP locally, generating embeddings from your product images, storing them in a vector database, and exposing search functionality through a FastAPI endpoint that can handle thousands of queries per day. The entire stack runs on a single server with a GPU, giving you Pinterest-style visual search capabilities for roughly $0.50 per day in compute costs instead of $0.02 per API call that quickly adds up.

CLIP (Contrastive Language-Image Pre-training) is a neural network trained by OpenAI that converts both images and text into vectors in the same mathematical space. You can search for images using natural language queries like "red sneakers with white laces" or find visually similar products by comparing their vector representations.

The model was trained on 400 million image-text pairs scraped from the internet, learning to associate visual concepts with their textual descriptions. When you feed CLIP an image, it outputs a 512-dimensional vector. Text input? Another 512-dimensional vector. If the image and text describe similar concepts, those vectors point in similar directions.

This shared embedding space is what makes visual search possible. You can encode your entire product catalog once, store those vectors, and then search by encoding a text query and finding the nearest image vectors. The distance between vectors tells you how semantically similar they are.

Why Local CLIP Deployment Matters for Developers and Businesses

Cloud vision APIs charge per request, which seems cheap until you scale. Google Cloud Vision charges $1.50 per 1,000 images for label detection. If you're running a product catalog with 50,000 items and re-indexing weekly, that's $75 per week or $3,900 annually just for embeddings, before any search queries.

Running CLIP locally on a single NVIDIA RTX 4090 can process approximately 2,400 images per hour at full resolution. For a 50,000-item catalog, that's 21 hours of compute at roughly $0.50 per day in electricity costs, totaling about $10.50 for a complete re-index. The hardware pays for itself in three months.

Beyond cost, local deployment gives you complete control over model versions, zero latency from network calls, no concerns about rate limits or service outages. You're also not sending potentially sensitive product images or user queries to third-party servers. For businesses handling proprietary designs or operating in regulated industries, this matters more than the cost savings.

How to Build Your Local CLIP Visual Search System

This implementation uses Python 3.10+, PyTorch, and the official OpenAI CLIP repository. You'll build a complete pipeline from image processing through to a queryable API endpoint.

Install Dependencies and Load the CLIP Model

Start by installing the required packages. CLIP requires PyTorch with CUDA support if you're using a GPU, which you should be for any serious workload.

pip install torch torchvision ftfy regex pillow
pip install git+https://github.com/openai/CLIP.git
pip install fastapi uvicorn faiss-cpu numpy

Load the CLIP model in your Python script. The ViT-B/32 variant balances speed and accuracy well for most use cases, processing images in roughly 15ms each on modern GPUs.

import torch
import clip
from PIL import Image
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

print(f"Model loaded on {device}")

Process Images and Generate Embeddings

Create a function that takes an image path, preprocesses it according to CLIP's requirements, and returns the embedding vector. CLIP expects images resized to 224x224 pixels with specific normalization.

def encode_image(image_path):
    image = Image.open(image_path).convert("RGB")
    image_input = preprocess(image).unsqueeze(0).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        image_features /= image_features.norm(dim=-1, keepdim=True)
    
    return image_features.cpu().numpy().flatten()

def encode_text(text_query):
    text_input = clip.tokenize([text_query]).to(device)
    
    with torch.no_grad():
        text_features = model.encode_text(text_input)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    
    return text_features.cpu().numpy().flatten()

The normalization step (dividing by the norm) ensures all vectors have length 1. This makes cosine similarity equivalent to dot product and speeds up search significantly.

Build the Vector Database with FAISS

FAISS (Facebook AI Similarity Search) is a library optimized for fast vector similarity search. It's significantly faster than computing distances in pure Python and supports indexes that don't fit in RAM.

import faiss
import os
import json

def build_index(image_folder):
    image_files = [f for f in os.listdir(image_folder) 
                   if f.endswith(('.jpg', '.jpeg', '.png'))]
    
    embeddings = []
    metadata = []
    
    for idx, filename in enumerate(image_files):
        image_path = os.path.join(image_folder, filename)
        embedding = encode_image(image_path)
        embeddings.append(embedding)
        metadata.append({
            'id': idx,
            'filename': filename,
            'path': image_path
        })
        
        if (idx + 1) % 100 == 0:
            print(f"Processed {idx + 1}/{len(image_files)} images")
    
    embeddings_array = np.array(embeddings).astype('float32')
    
    dimension = embeddings_array.shape[1]
    index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
    index.add(embeddings_array)
    
    faiss.write_index(index, "clip_index.faiss")
    with open("metadata.json", "w") as f:
        json.dump(metadata, f)
    
    print(f"Index built with {len(embeddings)} images")
    return index, metadata

This creates a flat index that computes exact distances. For catalogs over 10,000 items, consider using faiss.IndexIVFFlat for approximate nearest neighbor search, which trades a small amount of accuracy for 10-20x speed improvements.

Create the FastAPI Search Endpoint

Now wrap your search functionality in a FastAPI application. This creates a production-ready REST API that other services can query. If you're building microservices or need to understand folder structure for production AI apps, check out how to structure a production AI application folder.

from fastapi import FastAPI, Query
from typing import List, Dict
import uvicorn

app = FastAPI()

# Load index and metadata at startup
index = faiss.read_index("clip_index.faiss")
with open("metadata.json", "r") as f:
    metadata = json.load(f)

@app.get("/search")
async def search_images(
    query: str = Query(..., description="Text search query"),
    top_k: int = Query(10, ge=1, le=100)
) -> List[Dict]:
    
    query_embedding = encode_text(query)
    query_embedding = np.array([query_embedding]).astype('float32')
    
    distances, indices = index.search(query_embedding, top_k)
    
    results = []
    for idx, distance in zip(indices[0], distances[0]):
        if idx < len(metadata):
            result = metadata[idx].copy()
            result['similarity_score'] = float(distance)
            results.append(result)
    
    return results

@app.get("/health")
async def health_check():
    return {"status": "healthy", "index_size": index.ntotal}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Start the server with `python app.py` and test it with a curl command:

curl "http://localhost:8000/search?query=red%20running%20shoes&top_k=5"

You'll get back JSON with the top 5 most similar images and their similarity scores, typically ranging from 0.15 to 0.35 for good matches with the ViT-B/32 model.

Scaling from Demo to Production with 10,000+ Item Catalogs

Once you're beyond proof-of-concept, you'll hit practical constraints around index size, search speed, update frequency. A catalog with 25,000 products generates roughly 50MB of embedding data (25,000 items × 512 dimensions × 4 bytes per float), which fits comfortably in RAM on any modern server.

For indexes exceeding 100,000 items, switch to FAISS's IVF (Inverted File) indexes. These partition your vector space into clusters and only search relevant clusters for each query. Training an IVF index with 256 clusters on 100,000 images takes about 10 minutes but reduces search time from 200ms to 15ms per query.

def build_ivf_index(embeddings_array, n_clusters=256):
    dimension = embeddings_array.shape[1]
    quantizer = faiss.IndexFlatIP(dimension)
    index = faiss.IndexIVFFlat(quantizer, dimension, n_clusters)
    
    index.train(embeddings_array)
    index.add(embeddings_array)
    index.nprobe = 32  # Search 32 clusters per query
    
    return index

For incremental updates without rebuilding the entire index, maintain a separate small index for new items and merge it periodically. Search both indexes and combine results, accepting a small performance penalty for the flexibility of real-time updates.

Monitor your similarity score distributions. If most results cluster between 0.20 and 0.25, your threshold for "good match" needs tuning. In practice, scores above 0.28 typically indicate strong semantic matches, while anything below 0.18 is usually irrelevant.

Common Pitfalls and Optimization Tips for Better Search Results

The quality of your search results depends heavily on your input images. CLIP was trained on web images with lots of context, so clean product shots on white backgrounds actually perform worse than lifestyle images showing products in use. If you're getting poor results, try adding contextual images per product rather than just the standard catalog shot. Two or four work better than one.

Text queries need to match how people actually search, not how you categorize products internally. "Shoes for hiking" works better than "footwear category: trail" because CLIP learned from natural language on the internet. Run your query logs through the system and manually review the top 5 results for your most common searches to calibrate expectations. And honestly, most teams skip this part.

Batch processing is critical for initial indexing. Processing images one at a time leaves your GPU mostly idle. Instead, batch 32-64 images together and process them simultaneously. This increases throughput from 2,400 to roughly 8,500 images per hour on an RTX 4090.

def encode_images_batch(image_paths, batch_size=32):
    embeddings = []
    
    for i in range(0, len(image_paths), batch_size):
        batch_paths = image_paths[i:i+batch_size]
        images = [preprocess(Image.open(p).convert("RGB")) 
                  for p in batch_paths]
        image_input = torch.stack(images).to(device)
        
        with torch.no_grad():
            features = model.encode_image(image_input)
            features /= features.norm(dim=-1, keepdim=True)
            embeddings.extend(features.cpu().numpy())
    
    return np.array(embeddings)

Consider the CLIP model variant carefully. ViT-B/32 is the standard choice, but ViT-B/16 offers 15-20% better accuracy at 4x the compute cost. For fashion and products where visual details matter, the upgrade is usually worth it. For coarse-grained search like room types or general objects, stick with ViT-B/32.

If you need to understand how these embeddings relate to other multimodal AI approaches, the concepts overlap significantly with building visual RAG systems for PDFs with tables, where you're also working with combined visual and textual representations.

Cost Comparison: Local CLIP vs Cloud Vision APIs

Look, let's compare real numbers for a mid-size e-commerce site with 15,000 products, re-indexing monthly, handling 50,000 search queries per month. Cloud APIs charge separately for embedding generation and search queries, while local deployment has fixed hardware and electricity costs.

Using Google Cloud Vision API at $1.50 per 1,000 images for embedding and $0.002 per query, you'd pay $22.50 monthly for re-indexing plus $100 for queries, totaling $122.50 per month or $1,470 annually. AWS Rekognition has similar pricing that lands around $1,600 annually for this workload.

A local setup with an NVIDIA RTX 4070 (drawing roughly 200W under load) running 8 hours for monthly re-indexing plus 24/7 for queries at $0.12 per kWh costs approximately $19 monthly in electricity. The GPU itself costs around $600, meaning break-even happens in 5 months. After that, you're saving over $1,200 annually.

The calculation shifts for higher query volumes. At 500,000 queries monthly, cloud costs jump to $1,022.50 monthly while local costs increase only marginally to about $22 monthly. The gap widens dramatically as you scale, which is why Pinterest, Etsy, and similar platforms all run their own embedding infrastructure.

For developers just starting out or validating product-market fit, cloud APIs make sense because they eliminate upfront costs. But once you're processing more than 100,000 queries monthly, local deployment becomes dramatically cheaper. The crossover point sits around 80,000-90,000 queries per month for most pricing structures.

You now have a complete visual search system running locally without API dependencies. The FastAPI endpoint can integrate into any application, the FAISS index scales to millions of items with the right configuration, your per-query cost is effectively zero after the initial hardware investment. Start with your existing product images, encode them in batches, and you'll have working visual search in a few hours rather than the weeks it would take to build from scratch without CLIP.

Ready to stop reading and start shipping?

Get a free AI-powered SEO audit of your site

We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.

Run my free audit
WANT THE SHORTCUT

Need help applying this to your business?

The post above is the framework. Spend 30 minutes with me and we'll map it to your specific stack, budget, and timeline. No pitch, just a real scoping conversation.

ABOUT THIS BLOG

Common questions

Who writes the Elite AI Advantage blog?

Jake McCluskey, founder. Every post is either written by Jake directly or generated through his editorial pipeline and reviewed by him before publishing. Posts are grounded in 25 years of digital marketing work and 6+ years of building AI systems for SMB and mid-market clients. No ghostwriters, no AI-generated content posted without review.

How often does Elite AI Advantage publish new content?

New blog posts ship weekly on average. White papers and case studies publish less often, when there's a real engagement or thesis worth writing up. Subscribe to the RSS feed at /rss.xml to get every post the moment it goes live.

Can I use these posts in my own newsletter or report?

Yes, with attribution and a link back to the original. Quote a paragraph, share the framework, build on the idea, that's the whole point of publishing it. Don't republish the full post wholesale, and don't strip the attribution.

How do I get help applying these ideas to my business?

Two paths. If you want to diagnose first, run one of the free tools at /tools (audit, readiness, scope, ROI, GEO check). If you're ready to talk, book a free 30-minute discovery call. No pitch, just a real conversation about whether AI is the right next move for your specific situation.

What size businesses does Elite AI Advantage work with?

SMB and mid-market. Clients usually have between $1M and $100M in revenue and between 5 and 500 employees. Smaller than that, the free tools and blog are probably enough. Larger than that, you need an internal team and a different kind of consultancy. The sweet spot is real revenue, real complexity, and no AI in production yet.