Scaling RAG on AWS: A Reference Architecture

RAG (Retrieval Augmented Generation) has become the default pattern for enterprise LLM applications. For good reason: it dramatically reduces hallucination, enables the model to cite sources, and allows knowledge to be updated without retraining.

The prototype is easy. The production system is not.

The Production RAG Problem Space

When you move from notebook to production, you encounter problems that didn't exist at demo scale:

Retrieval latency compounds with LLM latency. At p99, this matters.
Index freshness — how quickly do document updates propagate to retrievable state?
Multi-tenancy — can users only retrieve documents they are authorised to see?
Scale — what happens to retrieval quality when your corpus grows 100×?
Cost — embedding every document is cheap once. Re-embedding on updates is not.

The Reference Architecture

Ingestion Pipeline

S3 (raw documents) →
Lambda (parse + chunk) →
SQS (ingestion queue) →
ECS (embedding workers) →
OpenSearch (vector + keyword index) +
RDS (metadata + access control)

Key decisions:

Chunk size: 512 tokens with 20% overlap for most document types
Embedding model: Cohere embed-english-v3 for cost/quality balance at scale; OpenAI text-embedding-3-large for quality-critical use cases
Index: OpenSearch Serverless for operational simplicity; dedicated cluster for >50M vectors

Retrieval Pipeline

User query →
Query expansion (rewrite + HyDE) →
Hybrid search (dense + sparse) →
Re-ranking (Cohere Rerank or cross-encoder) →
Access control filter →
Context assembly

The critical step most teams skip: re-ranking. Dense retrieval recall is high; precision is not. A re-ranker reduces your top-100 retrieved chunks to a top-5 with significantly higher precision, and it costs almost nothing at the scale of a user query.

Access Control

Multi-tenant RAG requires document-level access control enforced at query time, not at ingestion time. We implement this as a metadata filter applied post-retrieval — the vector search returns candidates, and a hard filter removes documents the requesting user cannot access.

Do not rely on the LLM to respect access boundaries. That is not a security control.

Performance Benchmarks

On AWS with the architecture above, at 1,000 concurrent users:

| Metric | Value | |---|---| | P50 retrieval latency | 45ms | | P99 retrieval latency | 180ms | | Index update lag | under 60 seconds | | Cost per 1,000 queries | $0.12 |

When to Deviate

Use Bedrock Knowledge Bases when:

You are already all-in on AWS managed services
Your team lacks ML engineering capacity
You can accept slightly higher latency and cost for operational simplicity

Use Pinecone or Qdrant instead of OpenSearch when:

Your team has strong vector database expertise
You need sub-10ms retrieval at extreme scale (>500M vectors)

Need help designing your RAG architecture? Our engineering team delivers production-ready RAG systems.