Back to Blog
Backend

Optimizing Vector Database Performance at Scale

Practical strategies for improving vector search latency, indexing throughput, and memory efficiency when your embedding collection grows beyond millions of vectors.

10 min read
#Vector Databases#Performance#Pinecone#HNSW

Vector databases are critical infrastructure for modern AI applications. But as your embedding collection scales from thousands to millions (or billions), performance degrades fast without the right optimizations.

This post shares practical strategies for keeping vector search fast and cost-effective at scale.

The Performance Challenge

When building our RAG system, we hit performance walls at predictable milestones:

  • 10K vectors: Sub-50ms queries ✅
  • 100K vectors: 100-200ms queries ⚠️
  • 1M+ vectors: 500ms+ queries ❌

Users expect instant results. Anything >200ms feels slow.

Understanding Vector Search Algorithms

Most vector databases use HNSW (Hierarchical Navigable Small World) indexing. Think of it as a multi-layer graph where:

  • Top layer: Sparse, long-distance connections
  • Bottom layer: Dense, local connections

Query flow:

  1. Start at top layer entry point
  2. Greedily follow edges toward query vector
  3. Descend to next layer
  4. Repeat until bottom layer
  5. Return nearest neighbors

Key Parameters

index_config = {
    "metric": "cosine",
    "pods": 1,
    "replicas": 1,
    "pod_type": "p1.x1",
    "index_config": {
        "m": 16,        # Max connections per node
        "ef_construction": 100,  # Build-time search depth
    }
}

Tradeoff: Higher m = better recall but more memory. Higher ef_construction = better index quality but slower builds.

Optimization Strategy 1: Dimensionality Reduction

We reduced embedding dimensions from 1536 (OpenAI default) to 768:

from sklearn.decomposition import PCA

# Fit on sample data
pca = PCA(n_components=768)
pca.fit(sample_embeddings)

# Transform all embeddings
reduced = pca.transform(embeddings)

Results:

  • 50% memory savings
  • 35% faster queries
  • 2% recall drop (acceptable tradeoff)

Optimization Strategy 2: Metadata Filtering

Filtering after retrieval is inefficient. Push filters to the database:

# Bad: Retrieve 100, filter to 10
results = index.query(vector, top_k=100)
filtered = [r for r in results if r.metadata["category"] == "docs"]

# Good: Filter during query
results = index.query(
    vector=vector,
    top_k=10,
    filter={"category": "docs"}
)

Impact: 10x reduction in data transfer and post-processing.

Optimization Strategy 3: Namespace Isolation

Pinecone supports namespaces for multi-tenant isolation:

# Separate namespaces per tenant
index.upsert(
    vectors=embeddings,
    namespace=f"tenant_{tenant_id}"
)

# Query only tenant's data
results = index.query(
    vector=query_embedding,
    namespace=f"tenant_{tenant_id}",
    top_k=5
)

Benefit: Queries only search relevant subset, dramatically improving speed for multi-tenant apps.

Optimization Strategy 4: Hybrid Indexes

Store vectors in tiers based on access patterns:

  • Hot tier: Recent/frequent queries → High-performance pods
  • Warm tier: Occasional access → Standard pods
  • Cold tier: Archive → Cheaper storage, lazy load
def route_query(query: str, recency: str):
    if recency == "recent":
        return hot_index.query(query)
    elif recency == "last_month":
        return warm_index.query(query)
    else:
        return cold_index.query(query)

Savings: 60% cost reduction by matching infrastructure to usage patterns.

Optimization Strategy 5: Connection Pooling

Don't create new connections per request:

from pinecone import Pinecone
from functools import lru_cache

@lru_cache(maxsize=1)
def get_pinecone_client():
    return Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

@lru_cache(maxsize=10)
def get_index(index_name: str):
    pc = get_pinecone_client()
    return pc.Index(index_name)

# Reuse connections
index = get_index("my-index")

Impact: Eliminates 50-100ms connection overhead per request.

Optimization Strategy 6: Batch Operations

Batch upserts are orders of magnitude faster:

# Bad: Individual upserts
for embedding in embeddings:
    index.upsert(vectors=[embedding])

# Good: Batch upsert
index.upsert(vectors=embeddings, batch_size=100)

Throughput: 500 vectors/sec → 5000 vectors/sec

Optimization Strategy 7: Approximate Search Tuning

Trade perfect recall for speed:

# High recall (slower)
results = index.query(vector, top_k=10, ef=200)

# Balanced (recommended)
results = index.query(vector, top_k=10, ef=100)

# Fast (lower recall)
results = index.query(vector, top_k=10, ef=50)

Benchmark recall vs latency for your use case.

Monitoring & Observability

Track these metrics:

import prometheus_client as prom

query_latency = prom.Histogram(
    "vector_search_latency_seconds",
    "Vector search latency",
    buckets=[0.05, 0.1, 0.2, 0.5, 1.0]
)

query_recalls = prom.Histogram(
    "vector_search_recall",
    "Vector search recall@10"
)

Alert when:

  • P95 latency >200ms
  • Recall drops >5%

Cost Optimization

Vector databases are expensive. Reduce costs:

  1. Right-size pods - Start small, scale based on metrics
  2. Use serverless - Pinecone Serverless for variable workloads
  3. Compress embeddings - Quantization (int8) saves 4x memory
  4. Archive old data - Move cold vectors to cheaper storage

Production Checklist

  • [ ] Dimension reduction evaluated
  • [ ] Metadata filters implemented
  • [ ] Namespaces for multi-tenancy
  • [ ] Connection pooling enabled
  • [ ] Batch operations for writes
  • [ ] Monitoring dashboards set up
  • [ ] Cost alerts configured
  • [ ] Backup/restore tested

Conclusion

Vector database optimization is iterative:

  1. Measure current performance
  2. Identify bottleneck (indexing? queries? memory?)
  3. Apply targeted optimization
  4. Benchmark improvement
  5. Repeat

Don't over-optimize prematurely. Build, measure, improve.


Resources:

Discuss: What optimization had the biggest impact for you? Let's chat on LinkedIn