Optimizing Vector Database Performance at Scale
Practical strategies for improving vector search latency, indexing throughput, and memory efficiency when your embedding collection grows beyond millions of vectors.
Vector databases are critical infrastructure for modern AI applications. But as your embedding collection scales from thousands to millions (or billions), performance degrades fast without the right optimizations.
This post shares practical strategies for keeping vector search fast and cost-effective at scale.
The Performance Challenge
When building our RAG system, we hit performance walls at predictable milestones:
- 10K vectors: Sub-50ms queries ✅
- 100K vectors: 100-200ms queries ⚠️
- 1M+ vectors: 500ms+ queries ❌
Users expect instant results. Anything >200ms feels slow.
Understanding Vector Search Algorithms
Most vector databases use HNSW (Hierarchical Navigable Small World) indexing. Think of it as a multi-layer graph where:
- Top layer: Sparse, long-distance connections
- Bottom layer: Dense, local connections
Query flow:
- Start at top layer entry point
- Greedily follow edges toward query vector
- Descend to next layer
- Repeat until bottom layer
- Return nearest neighbors
Key Parameters
index_config = {
"metric": "cosine",
"pods": 1,
"replicas": 1,
"pod_type": "p1.x1",
"index_config": {
"m": 16, # Max connections per node
"ef_construction": 100, # Build-time search depth
}
}
Tradeoff: Higher m = better recall but more memory. Higher ef_construction = better index quality but slower builds.
Optimization Strategy 1: Dimensionality Reduction
We reduced embedding dimensions from 1536 (OpenAI default) to 768:
from sklearn.decomposition import PCA
# Fit on sample data
pca = PCA(n_components=768)
pca.fit(sample_embeddings)
# Transform all embeddings
reduced = pca.transform(embeddings)
Results:
- 50% memory savings
- 35% faster queries
- 2% recall drop (acceptable tradeoff)
Optimization Strategy 2: Metadata Filtering
Filtering after retrieval is inefficient. Push filters to the database:
# Bad: Retrieve 100, filter to 10
results = index.query(vector, top_k=100)
filtered = [r for r in results if r.metadata["category"] == "docs"]
# Good: Filter during query
results = index.query(
vector=vector,
top_k=10,
filter={"category": "docs"}
)
Impact: 10x reduction in data transfer and post-processing.
Optimization Strategy 3: Namespace Isolation
Pinecone supports namespaces for multi-tenant isolation:
# Separate namespaces per tenant
index.upsert(
vectors=embeddings,
namespace=f"tenant_{tenant_id}"
)
# Query only tenant's data
results = index.query(
vector=query_embedding,
namespace=f"tenant_{tenant_id}",
top_k=5
)
Benefit: Queries only search relevant subset, dramatically improving speed for multi-tenant apps.
Optimization Strategy 4: Hybrid Indexes
Store vectors in tiers based on access patterns:
- Hot tier: Recent/frequent queries → High-performance pods
- Warm tier: Occasional access → Standard pods
- Cold tier: Archive → Cheaper storage, lazy load
def route_query(query: str, recency: str):
if recency == "recent":
return hot_index.query(query)
elif recency == "last_month":
return warm_index.query(query)
else:
return cold_index.query(query)
Savings: 60% cost reduction by matching infrastructure to usage patterns.
Optimization Strategy 5: Connection Pooling
Don't create new connections per request:
from pinecone import Pinecone
from functools import lru_cache
@lru_cache(maxsize=1)
def get_pinecone_client():
return Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
@lru_cache(maxsize=10)
def get_index(index_name: str):
pc = get_pinecone_client()
return pc.Index(index_name)
# Reuse connections
index = get_index("my-index")
Impact: Eliminates 50-100ms connection overhead per request.
Optimization Strategy 6: Batch Operations
Batch upserts are orders of magnitude faster:
# Bad: Individual upserts
for embedding in embeddings:
index.upsert(vectors=[embedding])
# Good: Batch upsert
index.upsert(vectors=embeddings, batch_size=100)
Throughput: 500 vectors/sec → 5000 vectors/sec
Optimization Strategy 7: Approximate Search Tuning
Trade perfect recall for speed:
# High recall (slower)
results = index.query(vector, top_k=10, ef=200)
# Balanced (recommended)
results = index.query(vector, top_k=10, ef=100)
# Fast (lower recall)
results = index.query(vector, top_k=10, ef=50)
Benchmark recall vs latency for your use case.
Monitoring & Observability
Track these metrics:
import prometheus_client as prom
query_latency = prom.Histogram(
"vector_search_latency_seconds",
"Vector search latency",
buckets=[0.05, 0.1, 0.2, 0.5, 1.0]
)
query_recalls = prom.Histogram(
"vector_search_recall",
"Vector search recall@10"
)
Alert when:
- P95 latency >200ms
- Recall drops >5%
Cost Optimization
Vector databases are expensive. Reduce costs:
- Right-size pods - Start small, scale based on metrics
- Use serverless - Pinecone Serverless for variable workloads
- Compress embeddings - Quantization (int8) saves 4x memory
- Archive old data - Move cold vectors to cheaper storage
Production Checklist
- [ ] Dimension reduction evaluated
- [ ] Metadata filters implemented
- [ ] Namespaces for multi-tenancy
- [ ] Connection pooling enabled
- [ ] Batch operations for writes
- [ ] Monitoring dashboards set up
- [ ] Cost alerts configured
- [ ] Backup/restore tested
Conclusion
Vector database optimization is iterative:
- Measure current performance
- Identify bottleneck (indexing? queries? memory?)
- Apply targeted optimization
- Benchmark improvement
- Repeat
Don't over-optimize prematurely. Build, measure, improve.
Resources:
Discuss: What optimization had the biggest impact for you? Let's chat on LinkedIn