Embedding APIs for RAG: Model Comparison & Implementation Guide (2026)

Embedding APIs for RAG: Model Comparison & Implementation Guide (2026)

TL;DR

Embeddings are the backbone of every Retrieval-Augmented Generation system. This guide compares the top embedding models of 2026 — OpenAI text-embedding-3-large, Cohere embed-v4, Voyage 3.5, and BGE-M3 — with benchmarks, pricing, and dimensions. You’ll learn how to generate embeddings, build a complete RAG pipeline in Python, choose the right chunking strategy, pick a vector database, optimize retrieval with hybrid search and reranking, and keep costs under control at scale. Whether you’re building your first RAG prototype or optimizing a production system handling millions of queries, this guide covers what you need.

What Are Embeddings and Why They Matter for RAG

An embedding is a numerical representation of text as a dense vector — a list of floating-point numbers that captures the semantic meaning of a passage. Two pieces of text with similar meanings produce vectors that are close together in the embedding space, even if they use entirely different words.

This property is what makes embeddings essential for RAG. When a user asks a question, you embed the query, search your document embeddings for the most similar vectors, and pass the matching text chunks to a language model as context. The LLM then generates an answer grounded in your actual data rather than relying solely on its training knowledge.

Why RAG Needs Good Embeddings

The quality of your embedding model directly determines the quality of your RAG system. If the embeddings fail to capture the semantic relationship between a user’s query and the relevant document passage, no amount of prompt engineering or LLM quality will save you. Garbage retrieval in, garbage answers out.

Here’s what good embeddings give you:

  • Semantic matching: A query like “how to fix memory leaks in Python” should match documents about “debugging Python memory issues” and “garbage collection problems,” not just documents containing the exact phrase “memory leaks.”
  • Cross-lingual understanding: Multilingual models can match a question in English with an answer in Japanese, enabling retrieval across language barriers.
  • Efficient search: Dense vectors enable approximate nearest neighbor (ANN) search algorithms that can query billions of documents in milliseconds.
  • Compact representation: A 500-word paragraph gets compressed into a single vector of 1024-3072 numbers, making storage and comparison computationally tractable.

The Embedding-to-Answer Pipeline

At a high level, a RAG system follows this flow:

  1. Indexing phase (offline): Split your documents into chunks, embed each chunk, store the vectors in a database.
  2. Query phase (online): Embed the user’s question, search for the nearest vectors, retrieve the original text, pass it to an LLM to generate an answer.

Each step has multiple design decisions that affect accuracy, latency, and cost. We’ll cover all of them.

Top Embedding Models Comparison (2026)

The embedding model landscape has matured considerably. Here’s how the leading options stack up:

ModelProviderDimensionsMax TokensMTEB Retrieval AvgPrice (per 1M tokens)MultilingualKey Strength
text-embedding-3-largeOpenAI3072 (configurable)8,19167.3$0.13LimitedFlexible dims, wide adoption
embed-v4Cohere102451266.8$0.10100+ languagesMultilingual, built-in search/doc types
voyage-3.5Voyage AI102432,00067.1$0.06ModerateLong context, code retrieval
BGE-M3BAAI10248,19265.2Free (self-hosted)100+ languagesOpen source, no API costs
GeckoGoogle7682,04864.5$0.025ModerateLow cost, Google ecosystem

Embedding model comparison chart showing MTEB scores, max tokens, and pricing for OpenAI text-embedding-3-large, Cohere embed-v4, Voyage 3.5, BGE-M3, and Google Gecko

Key Takeaways from the Comparison

OpenAI text-embedding-3-large remains the default choice for many teams. Its configurable dimensionality is unique — you can request 256-dimension vectors for low-cost applications or 3072 for maximum accuracy, all from the same model. The OpenAI SDK is ubiquitous and well-documented.

Cohere embed-v4 is the multilingual champion. If your documents span multiple languages, Cohere’s model handles cross-lingual retrieval natively. It also supports explicit input types (search_document vs search_query), which improves retrieval quality by optimizing the embedding based on its intended use.

Voyage 3.5 stands out for two reasons: an extremely long 32,000-token context window (useful for embedding entire documents without chunking) and excellent code retrieval performance. If you’re building a code search system, Voyage is worth serious consideration. The pricing is also very competitive.

BGE-M3 (from the Beijing Academy of AI) is the leading open-source model. It supports dense, sparse, and multi-vector retrieval in a single model, making it uniquely versatile. Running it yourself requires GPU infrastructure but eliminates per-token costs entirely — a major advantage at scale.

Generating Embeddings with Python

Let’s get practical. Here’s how to generate embeddings using the OpenAI SDK, which is the most common approach and is also compatible with aggregation platforms that expose an OpenAI-compatible endpoint.

Basic Embedding Generation

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def get_embedding(text: str, model: str = "text-embedding-3-large", dimensions: int = 1024) -> list[float]:
    """Generate an embedding for a single text string."""
    response = client.embeddings.create(
        input=text,
        model=model,
        dimensions=dimensions  # Only supported by OpenAI's -3 models
    )
    return response.data[0].embedding

# Example usage
query_embedding = get_embedding("How does photosynthesis work?")
print(f"Embedding dimension: {len(query_embedding)}")  # 1024
print(f"First 5 values: {query_embedding[:5]}")

Batch Embedding for Efficiency

The API accepts up to 2,048 texts in a single request. Always batch your embeddings to minimize HTTP overhead:

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def get_embeddings_batch(
    texts: list[str],
    model: str = "text-embedding-3-large",
    dimensions: int = 1024,
    batch_size: int = 2048
) -> list[list[float]]:
    """Generate embeddings for a list of texts in batches."""
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        response = client.embeddings.create(
            input=batch,
            model=model,
            dimensions=dimensions
        )
        # Sort by index to maintain original order
        batch_embeddings = [item.embedding for item in sorted(response.data, key=lambda x: x.index)]
        all_embeddings.extend(batch_embeddings)

    return all_embeddings

# Embed 5000 document chunks
chunks = ["Document chunk text here..."] * 5000
embeddings = get_embeddings_batch(chunks)
print(f"Generated {len(embeddings)} embeddings")

Using Cohere’s Input Types

Cohere’s API distinguishes between documents and queries, which improves retrieval quality:

import cohere

co = cohere.ClientV2(api_key="your-cohere-key")

# Embed documents (during indexing)
doc_response = co.embed(
    texts=["Photosynthesis converts sunlight into chemical energy..."],
    model="embed-v4",
    input_type="search_document",
    embedding_types=["float"]
)
doc_embedding = doc_response.embeddings.float_[0]

# Embed queries (during search)
query_response = co.embed(
    texts=["How do plants make food from sunlight?"],
    model="embed-v4",
    input_type="search_query",
    embedding_types=["float"]
)
query_embedding = query_response.embeddings.float_[0]

Using an Aggregation Endpoint

If you use an API aggregation platform like Ofox.ai, you can access multiple embedding models through a single OpenAI-compatible endpoint. This simplifies switching between models:

from openai import OpenAI

# Point to the aggregation endpoint
client = OpenAI(
    api_key="your-aggregation-key",
    base_url="https://api.ofox.ai/v1"
)

# Switch models by changing a single parameter
openai_embedding = client.embeddings.create(
    input="test text",
    model="text-embedding-3-large",
    dimensions=1024
)

# Same client, different model
cohere_embedding = client.embeddings.create(
    input="test text",
    model="embed-v4"
)

Building a RAG Pipeline Step by Step

Let’s build a complete RAG pipeline from scratch. We’ll use OpenAI embeddings, Qdrant as the vector database, and Claude for answer generation.

Step 1: Document Loading

from pathlib import Path

def load_documents(directory: str) -> list[dict]:
    """Load text documents from a directory."""
    docs = []
    for filepath in Path(directory).glob("**/*.txt"):
        text = filepath.read_text(encoding="utf-8")
        docs.append({
            "id": str(filepath),
            "text": text,
            "metadata": {
                "filename": filepath.name,
                "directory": str(filepath.parent)
            }
        })
    return docs

documents = load_documents("./knowledge_base")
print(f"Loaded {len(documents)} documents")

Step 2: Chunking

def recursive_chunk(text: str, max_tokens: int = 512, overlap: int = 50) -> list[str]:
    """
    Split text into chunks using recursive character splitting.
    Uses approximate token count (1 token ≈ 4 characters).
    """
    max_chars = max_tokens * 4
    overlap_chars = overlap * 4
    separators = ["\n\n", "\n", ". ", " "]

    def split_text(text: str, separators: list[str]) -> list[str]:
        if len(text) <= max_chars:
            return [text.strip()] if text.strip() else []

        sep = separators[0] if separators else " "
        remaining_seps = separators[1:] if len(separators) > 1 else separators

        parts = text.split(sep)
        chunks = []
        current_chunk = ""

        for part in parts:
            candidate = f"{current_chunk}{sep}{part}" if current_chunk else part

            if len(candidate) <= max_chars:
                current_chunk = candidate
            else:
                if current_chunk:
                    chunks.append(current_chunk.strip())
                # If a single part is too long, split it further
                if len(part) > max_chars and remaining_seps:
                    chunks.extend(split_text(part, remaining_seps))
                else:
                    current_chunk = part

        if current_chunk:
            chunks.append(current_chunk.strip())

        return chunks

    raw_chunks = split_text(text, separators)

    # Add overlap between consecutive chunks
    if overlap_chars > 0 and len(raw_chunks) > 1:
        overlapped = [raw_chunks[0]]
        for i in range(1, len(raw_chunks)):
            prev_tail = raw_chunks[i - 1][-overlap_chars:]
            overlapped.append(prev_tail + " " + raw_chunks[i])
        return overlapped

    return raw_chunks

Step 3: Embed and Store

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

# Initialize clients
openai_client = OpenAI(api_key="your-api-key")
qdrant = QdrantClient(url="http://localhost:6333")

COLLECTION_NAME = "knowledge_base"
EMBEDDING_DIM = 1024

# Create collection
qdrant.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(
        size=EMBEDDING_DIM,
        distance=Distance.COSINE
    )
)

def index_documents(documents: list[dict]) -> int:
    """Chunk, embed, and store documents in Qdrant."""
    points = []

    for doc in documents:
        chunks = recursive_chunk(doc["text"])

        if not chunks:
            continue

        # Batch embed all chunks for this document
        response = openai_client.embeddings.create(
            input=chunks,
            model="text-embedding-3-large",
            dimensions=EMBEDDING_DIM
        )

        for j, (chunk, item) in enumerate(zip(chunks, sorted(response.data, key=lambda x: x.index))):
            points.append(PointStruct(
                id=str(uuid.uuid4()),
                vector=item.embedding,
                payload={
                    "text": chunk,
                    "doc_id": doc["id"],
                    "chunk_index": j,
                    **doc["metadata"]
                }
            ))

    # Upsert in batches of 100
    for i in range(0, len(points), 100):
        qdrant.upsert(
            collection_name=COLLECTION_NAME,
            points=points[i : i + 100]
        )

    return len(points)

total_chunks = index_documents(documents)
print(f"Indexed {total_chunks} chunks")

Step 4: Retrieval and Answer Generation

from openai import OpenAI
import anthropic

openai_client = OpenAI(api_key="your-openai-key")
claude_client = anthropic.Anthropic(api_key="your-anthropic-key")

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Retrieve the most relevant chunks for a query."""
    query_response = openai_client.embeddings.create(
        input=query,
        model="text-embedding-3-large",
        dimensions=EMBEDDING_DIM
    )
    query_vector = query_response.data[0].embedding

    results = qdrant.query_points(
        collection_name=COLLECTION_NAME,
        query=query_vector,
        limit=top_k,
        with_payload=True
    )

    return [
        {"text": point.payload["text"], "score": point.score, "doc_id": point.payload["doc_id"]}
        for point in results.points
    ]

def rag_answer(question: str) -> str:
    """Full RAG pipeline: retrieve context, then generate an answer."""
    # Retrieve relevant chunks
    chunks = retrieve(question, top_k=5)

    # Format context
    context = "\n\n---\n\n".join(
        f"[Source: {c['doc_id']} | Relevance: {c['score']:.3f}]\n{c['text']}"
        for c in chunks
    )

    # Generate answer with Claude
    response = claude_client.messages.create(
        model="claude-sonnet-4-6-20260301",
        max_tokens=2000,
        system="You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain enough information, say so clearly. Cite sources when possible.",
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer based on the context above:"
        }]
    )

    return response.content[0].text

# Use it
answer = rag_answer("What are the main causes of climate change?")
print(answer)

Chunking Strategies: Choosing the Right Approach

The way you split documents into chunks has a surprisingly large impact on RAG quality. Poor chunking leads to either incomplete context (chunks too small) or diluted relevance (chunks too large). Here are the three main approaches.

Fixed-Size Chunking

The simplest strategy: split text into chunks of N tokens with M tokens of overlap.

Pros: Fast, deterministic, easy to implement. Cons: Splits mid-sentence and mid-paragraph, breaking semantic coherence. Best for: Homogeneous content like log files, CSV data, or simple text where structure doesn’t matter.

import tiktoken

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Split text into fixed-size token chunks with overlap."""
    enc = tiktoken.encoding_for_model("gpt-4")
    tokens = enc.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start = end - overlap

    return chunks

Recursive Character Splitting

The standard approach used by LangChain and most RAG frameworks. It tries to split on paragraph boundaries first, then sentences, then words — preserving as much semantic structure as possible.

Pros: Respects document structure, produces coherent chunks. Cons: Chunk sizes vary, may produce very small chunks from short sections. Best for: General-purpose text, articles, documentation.

This is the strategy we implemented in Step 2 above. It’s the recommended default for most use cases.

Semantic Chunking

The most sophisticated approach: use embeddings to detect topic shifts and place chunk boundaries where the topic changes.

Pros: Chunks are topically coherent, leading to better retrieval precision. Cons: Requires an embedding model pass during chunking (added cost and complexity), harder to debug. Best for: Long documents with multiple topics, legal documents, research papers.

import numpy as np
from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def semantic_chunk(text: str, threshold: float = 0.3) -> list[str]:
    """Split text into chunks based on semantic similarity between sentences."""
    # Split into sentences
    sentences = [s.strip() for s in text.replace("\n", " ").split(". ") if s.strip()]

    if len(sentences) <= 1:
        return [text]

    # Embed all sentences
    response = client.embeddings.create(
        input=sentences,
        model="text-embedding-3-large",
        dimensions=256  # Low dims is fine for similarity comparison
    )
    embeddings = [item.embedding for item in sorted(response.data, key=lambda x: x.index)]

    # Find breakpoints where consecutive sentences are dissimilar
    breakpoints = [0]
    for i in range(1, len(embeddings)):
        similarity = np.dot(embeddings[i - 1], embeddings[i]) / (
            np.linalg.norm(embeddings[i - 1]) * np.linalg.norm(embeddings[i])
        )
        if similarity < (1 - threshold):  # Low similarity = topic shift
            breakpoints.append(i)
    breakpoints.append(len(sentences))

    # Group sentences into chunks
    chunks = []
    for i in range(len(breakpoints) - 1):
        chunk_sentences = sentences[breakpoints[i]:breakpoints[i + 1]]
        chunks.append(". ".join(chunk_sentences) + ".")

    return chunks
Content TypeChunk Size (tokens)Overlap (tokens)Strategy
General articles51250Recursive
Technical documentation38464Recursive
Legal documents768100Semantic
Code filesFunction-level0AST-aware
FAQ/Q&A pairsPer question0Fixed (by item)
Chat transcriptsPer turn or group1 turnFixed (by turn)

Vector Database Options

Your vector database stores the embeddings and performs similarity search. Here’s how the major options compare for production RAG systems.

Comparison Table

DatabaseTypeMax VectorsHybrid SearchFilteringPricing ModelBest For
PineconeManagedBillionsYes (2024+)MetadataPer pod/serverlessTeams wanting zero ops
QdrantSelf-hosted / CloudBillionsYesPayload filtersFree (self-hosted) / UsagePerformance-critical apps
WeaviateSelf-hosted / CloudBillionsYes (BM25 + vector)GraphQL filtersFree (self-hosted) / UsageComplex filtering needs
pgvectorPostgreSQL extensionMillionsWith pg_searchFull SQLFree (self-hosted)Teams already on PostgreSQL
ChromaEmbeddedMillionsNoMetadataFree (open source)Prototyping, small datasets
MilvusSelf-hosted / CloudBillionsYesAttribute filteringFree (self-hosted) / Zilliz CloudLarge-scale deployment

pgvector: Minimal Infrastructure

If you already use PostgreSQL, pgvector is the lowest-friction option:

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect("postgresql://localhost/mydb")
register_vector(conn)
cur = conn.cursor()

# Create table with vector column
cur.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id SERIAL PRIMARY KEY,
        text TEXT NOT NULL,
        metadata JSONB,
        embedding vector(1024)
    )
""")

# Create HNSW index for fast search
cur.execute("""
    CREATE INDEX IF NOT EXISTS docs_embedding_idx
    ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200)
""")
conn.commit()

# Insert a document
cur.execute(
    "INSERT INTO documents (text, metadata, embedding) VALUES (%s, %s, %s)",
    ("Example document text", '{"source": "wiki"}', embedding_vector)
)
conn.commit()

# Search
cur.execute(
    "SELECT text, metadata, 1 - (embedding <=> %s::vector) AS similarity FROM documents ORDER BY embedding <=> %s::vector LIMIT 5",
    (query_vector, query_vector)
)
results = cur.fetchall()

Qdrant: High Performance

Qdrant offers excellent single-node performance and a clean Python client:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue

client = QdrantClient(url="http://localhost:6333")

# Create collection
client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)

# Search with metadata filtering
results = client.query_points(
    collection_name="docs",
    query=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="engineering"))]
    ),
    limit=10
)

Choosing the Right Database

For prototyping: Start with Chroma (embedded, zero setup) or pgvector (if you already have PostgreSQL).

For production under 10M vectors: pgvector with HNSW indexes handles this comfortably. One less service to manage.

For production over 10M vectors: Qdrant or Pinecone. You’ll need dedicated vector search infrastructure at this scale, and these are battle-tested.

For hybrid search as a priority: Weaviate has the most mature hybrid search (BM25 + vector), with fine-grained control over the weighting between keyword and semantic results.

Retrieval Optimization: Hybrid Search and Reranking

Vector similarity search alone has a well-known weakness: it can miss results that are lexically relevant but semantically distant. A user searching for “error code E-4012” might not find the document about that specific error if the embedding model doesn’t capture the exact code. This is where hybrid search and reranking come in.

Hybrid search combines two retrieval methods:

  1. Dense retrieval (vector similarity): Finds semantically similar documents.
  2. Sparse retrieval (BM25/keyword matching): Finds documents with matching keywords.

The results are merged using Reciprocal Rank Fusion (RRF) or a weighted combination:

def hybrid_search(query: str, alpha: float = 0.7, top_k: int = 10) -> list[dict]:
    """
    Combine dense and sparse search results.
    alpha: weight for dense search (1.0 = pure vector, 0.0 = pure keyword)
    """
    # Dense search
    query_embedding = get_embedding(query)
    dense_results = vector_db.search(query_embedding, limit=top_k * 2)

    # Sparse search (BM25)
    sparse_results = bm25_index.search(query, limit=top_k * 2)

    # Reciprocal Rank Fusion
    scores = {}
    k = 60  # RRF constant

    for rank, result in enumerate(dense_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + alpha * (1 / (k + rank + 1))

    for rank, result in enumerate(sparse_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + (1 - alpha) * (1 / (k + rank + 1))

    # Sort by combined score and return top_k
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    return [{"id": doc_id, "score": score} for doc_id, score in ranked]

Reranking

Reranking adds a second stage that uses a cross-encoder model to score each retrieved chunk against the query. Cross-encoders are more accurate than bi-encoders (embedding models) because they process the query and document together, capturing fine-grained interactions.

import cohere

co = cohere.ClientV2(api_key="your-cohere-key")

def rerank_results(query: str, documents: list[str], top_n: int = 5) -> list[dict]:
    """Rerank retrieved documents using Cohere Rerank."""
    response = co.rerank(
        query=query,
        documents=documents,
        model="rerank-v3.5",
        top_n=top_n
    )

    return [
        {"index": result.index, "text": documents[result.index], "relevance_score": result.relevance_score}
        for result in response.results
    ]

# Usage in the RAG pipeline
initial_chunks = retrieve(query, top_k=25)  # Broad initial retrieval
chunk_texts = [c["text"] for c in initial_chunks]
reranked = rerank_results(query, chunk_texts, top_n=5)  # Narrow to best 5

The Retrieve-then-Rerank Pattern

Retrieve-then-rerank pipeline diagram showing the funnel from hybrid search (50 candidates) through cross-encoder reranking (top 5) to LLM answer generation

The optimal retrieval pipeline for production RAG follows this pattern:

  1. Broad retrieval: Fetch top 20-50 candidates using hybrid search (fast, recall-optimized).
  2. Rerank: Score all candidates with a cross-encoder, select top 3-5 (slow but precise).
  3. Generate: Pass the top chunks to the LLM for answer generation.

This two-stage approach gives you the best of both worlds: high recall from the broad first stage and high precision from the reranking second stage.

Cost Optimization for Embeddings at Scale

Embedding costs can add up quickly when you’re processing millions of documents. Here are concrete strategies to keep costs manageable.

1. Reduce Dimensions

OpenAI’s text-embedding-3-large supports native dimension reduction. Dropping from 3072 to 1024 dimensions:

  • Reduces storage by 66%
  • Reduces vector search latency by approximately 50%
  • Costs the same per API call (pricing is per token, not per dimension)
  • Only loses 1-2% retrieval accuracy on most benchmarks

Always use the dimensions parameter — there’s rarely a reason to store full 3072-dimension vectors.

2. Deduplicate Before Embedding

Don’t embed duplicate or near-duplicate content. Hash your chunks and skip duplicates:

import hashlib

def deduplicate_chunks(chunks: list[str]) -> list[str]:
    """Remove exact duplicate chunks."""
    seen = set()
    unique = []
    for chunk in chunks:
        h = hashlib.sha256(chunk.encode()).hexdigest()
        if h not in seen:
            seen.add(h)
            unique.append(chunk)
    return unique

For large-scale deduplication, use MinHash or SimHash to also catch near-duplicates (chunks that differ by only a few words).

3. Use Open-Source Models at Scale

Beyond approximately 50 million tokens of embedding volume per month, self-hosting BGE-M3 on a GPU becomes cheaper than API calls:

Monthly VolumeOpenAI text-embedding-3-largeSelf-hosted BGE-M3 (A100)
10M tokens$1.30~$15 (over-provisioned)
100M tokens$13.00~$15
1B tokens$130.00~$25 (scales well)
10B tokens$1,300.00~$100 (multiple GPUs)

The crossover point is roughly 100-200M tokens per month, depending on your GPU costs.

4. Cache Embeddings Aggressively

Never embed the same text twice. Implement an embedding cache:

import hashlib
import json
from pathlib import Path

class EmbeddingCache:
    """Simple file-based embedding cache."""

    def __init__(self, cache_dir: str = ".embedding_cache"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

    def _key(self, text: str, model: str, dims: int) -> str:
        raw = f"{model}:{dims}:{text}"
        return hashlib.sha256(raw.encode()).hexdigest()

    def get(self, text: str, model: str, dims: int) -> list[float] | None:
        path = self.cache_dir / f"{self._key(text, model, dims)}.json"
        if path.exists():
            return json.loads(path.read_text())
        return None

    def set(self, text: str, model: str, dims: int, embedding: list[float]) -> None:
        path = self.cache_dir / f"{self._key(text, model, dims)}.json"
        path.write_text(json.dumps(embedding))

For production, replace the file-based cache with Redis or another key-value store.

5. Batch and Parallelize

Always embed in batches. The difference between 1,000 individual API calls and a single batch call with 1,000 texts is enormous — not in token cost, but in HTTP overhead, rate limiting, and wall-clock time.

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI(api_key="your-api-key")

async def embed_all_chunks(chunks: list[str], batch_size: int = 2048) -> list[list[float]]:
    """Embed all chunks with concurrent batch requests."""
    batches = [chunks[i:i + batch_size] for i in range(0, len(chunks), batch_size)]

    async def embed_batch(batch: list[str]) -> list[list[float]]:
        response = await async_client.embeddings.create(
            input=batch,
            model="text-embedding-3-large",
            dimensions=1024
        )
        return [item.embedding for item in sorted(response.data, key=lambda x: x.index)]

    # Run up to 5 batches concurrently
    semaphore = asyncio.Semaphore(5)

    async def limited_embed(batch: list[str]) -> list[list[float]]:
        async with semaphore:
            return await embed_batch(batch)

    results = await asyncio.gather(*[limited_embed(b) for b in batches])
    return [emb for batch_result in results for emb in batch_result]

Common Pitfalls and How to Avoid Them

Building RAG systems involves many subtle decisions. Here are the mistakes that trip up most teams.

1. Chunks Too Large or Too Small

Problem: Chunks of 2000+ tokens dilute the relevant information with noise, reducing retrieval precision. Chunks under 100 tokens lack enough context for the LLM to generate good answers.

Fix: Start with 512 tokens and 50 tokens of overlap. Measure retrieval recall on a test set and adjust. If recall is low, try smaller chunks (256-384). If the LLM struggles with incomplete context, try larger chunks (768-1024) or retrieve more chunks.

2. Not Normalizing Text Before Embedding

Problem: Extra whitespace, inconsistent encoding, HTML artifacts, and other noise reduce embedding quality.

Fix: Clean text before embedding:

import re
import unicodedata

def normalize_text(text: str) -> str:
    """Normalize text before embedding."""
    text = unicodedata.normalize("NFKC", text)
    text = re.sub(r"\s+", " ", text)  # Collapse whitespace
    text = re.sub(r"<[^>]+>", "", text)  # Strip HTML
    text = text.strip()
    return text

3. Using the Wrong Distance Metric

Problem: Using Euclidean distance when the model was trained for cosine similarity (or vice versa) degrades results.

Fix: Use cosine similarity for OpenAI and Cohere models. Check the model documentation for the recommended metric. Most modern embedding models are trained with cosine similarity in mind.

4. Ignoring the Query-Document Asymmetry

Problem: User queries are short (“how to fix memory leak”), while documents are long paragraphs. Embedding both the same way loses information.

Fix: Use models that support input types (Cohere’s search_query vs search_document). For models without explicit types, consider prepending “search query: ” to queries and “search document: ” to documents — this is a technique from the E5 model family that often helps even with other models.

5. Not Evaluating Retrieval Quality Separately from Generation Quality

Problem: When the RAG system gives bad answers, teams blame the LLM. Often, the retrieval step is the real problem — the right chunks were never found.

Fix: Evaluate retrieval and generation independently. Build a test set of queries with known relevant chunks. Measure retrieval recall@k (what percentage of relevant chunks appear in the top-k results). Only after retrieval quality is solid should you optimize the generation step.

6. Embedding Stale or Redundant Content

Problem: Re-embedding your entire corpus every time a few documents change wastes money and time.

Fix: Track document hashes. Only re-embed documents whose content has changed. Use your vector database’s upsert functionality to update individual vectors without rebuilding the entire index.

7. Skipping Metadata Filtering

Problem: Searching your entire vector space when the query clearly relates to a specific category, date range, or document type wastes computation and returns irrelevant results.

Fix: Attach metadata (category, date, source, language) to every chunk during indexing. Apply metadata filters before vector search to narrow the search space. This dramatically improves both speed and relevance.

8. Not Planning for Embedding Model Upgrades

Problem: When a better embedding model comes out, you have to re-embed everything — which is expensive and time-consuming.

Fix: Design your system to handle re-embedding. Keep the original text alongside the vectors. Use collection versioning (e.g., docs_v1, docs_v2) so you can run old and new embeddings in parallel during migration. API aggregation platforms like Ofox.ai make it easier to switch between embedding providers since you only need to change the model parameter, not the entire integration.

9. Overcomplicating the Pipeline

Problem: Teams add every optimization at once — hybrid search, reranking, query expansion, hypothetical document embeddings (HyDE) — creating a system that’s complex, slow, and hard to debug.

Fix: Start simple. Basic chunking + vector search + a good LLM will get you surprisingly far. Add complexity only when you’ve measured a specific problem. Profile your pipeline to identify the actual bottleneck before adding more stages.

10. Ignoring Latency Budgets

Problem: A RAG pipeline with embedding generation (100ms) + vector search (50ms) + reranking (200ms) + LLM generation (2000ms) adds up to 2.3+ seconds. For real-time applications, this may be too slow.

Fix: Set a latency budget upfront. If you need sub-second responses, skip reranking and use a faster LLM. Pre-compute query embeddings where possible. Use approximate nearest neighbor search with higher speed settings (lower accuracy is usually acceptable). Consider caching frequent queries and their answers.

Conclusion

Building an effective RAG system is about making smart trade-offs at every layer — from the embedding model you choose, to how you chunk documents, to whether you invest in hybrid search and reranking. The good news is that the tooling has matured enormously. In 2026, you can go from zero to a production-quality RAG pipeline in a weekend, thanks to excellent embedding APIs, robust vector databases, and clear best practices.

Start with the defaults: OpenAI text-embedding-3-large at 1024 dimensions, recursive chunking at 512 tokens, and your vector database of choice. Measure retrieval quality on a representative test set. Then add complexity — hybrid search, reranking, semantic chunking — only where measurements show it’s needed. The simplest pipeline that meets your accuracy and latency requirements is the best pipeline.