RAG (Retrieval Augmented Generation) has become the default pattern for enterprise LLM applications. For good reason: it dramatically reduces hallucination, enables the model to cite sources, and allows knowledge to be updated without retraining.
The prototype is easy. The production system is not.
The Production RAG Problem Space
When you move from notebook to production, you encounter problems that didn't exist at demo scale:
- Retrieval latency compounds with LLM latency. At p99, this matters.
- Index freshness — how quickly do document updates propagate to retrievable state?
- Multi-tenancy — can users only retrieve documents they are authorised to see?
- Scale — what happens to retrieval quality when your corpus grows 100×?
- Cost — embedding every document is cheap once. Re-embedding on updates is not.
The Reference Architecture
Ingestion Pipeline
S3 (raw documents) →
Lambda (parse + chunk) →
SQS (ingestion queue) →
ECS (embedding workers) →
OpenSearch (vector + keyword index) +
RDS (metadata + access control)
Key decisions:
- Chunk size: 512 tokens with 20% overlap for most document types
- Embedding model: Cohere embed-english-v3 for cost/quality balance at scale; OpenAI text-embedding-3-large for quality-critical use cases
- Index: OpenSearch Serverless for operational simplicity; dedicated cluster for >50M vectors
Retrieval Pipeline
User query →
Query expansion (rewrite + HyDE) →
Hybrid search (dense + sparse) →
Re-ranking (Cohere Rerank or cross-encoder) →
Access control filter →
Context assembly
The critical step most teams skip: re-ranking. Dense retrieval recall is high; precision is not. A re-ranker reduces your top-100 retrieved chunks to a top-5 with significantly higher precision, and it costs almost nothing at the scale of a user query.
Access Control
Multi-tenant RAG requires document-level access control enforced at query time, not at ingestion time. We implement this as a metadata filter applied post-retrieval — the vector search returns candidates, and a hard filter removes documents the requesting user cannot access.
Do not rely on the LLM to respect access boundaries. That is not a security control.
Performance Benchmarks
On AWS with the architecture above, at 1,000 concurrent users:
| Metric | Value | |---|---| | P50 retrieval latency | 45ms | | P99 retrieval latency | 180ms | | Index update lag | under 60 seconds | | Cost per 1,000 queries | $0.12 |
When to Deviate
Use Bedrock Knowledge Bases when:
- You are already all-in on AWS managed services
- Your team lacks ML engineering capacity
- You can accept slightly higher latency and cost for operational simplicity
Use Pinecone or Qdrant instead of OpenSearch when:
- Your team has strong vector database expertise
- You need sub-10ms retrieval at extreme scale (>500M vectors)
Need help designing your RAG architecture? Our engineering team delivers production-ready RAG systems.