Production RAG Architecture with Lambda Scaling
🏗️ Production RAG Architecture & Lambda Scaling - Complete Summary
📋 Table of Contents
- RAG Architecture Overview
- Streaming Responses
- Lambda Scaling Fundamentals
- SQS, Retry Logic & DLQ
- Lambda Configuration for Production
- Reserved vs Provisioned Concurrency
- EventBridge Scheduling
- CloudFormation Snippets
- Complete Architecture Diagram
- Quick Reference
1. RAG Architecture Overview
The Problem
Design a scalable RAG pipeline with:
- Snowflake as vector database
- Bedrock (Anthropic) for LLM
- High throughput + Low latency
The Solution Architecture
User Query
↓
┌─────────────────┐
│ API Gateway │ (Entry point)
└────────┬────────┘
↓
┌─────────────────┐
│ SQS Queue │ (Buffer & resilience)
└────────┬────────┘
↓
┌─────────────────┐
│ Lambda │ → Embed query (Bedrock)
│ (RAG Worker) │ → Vector search (Snowflake)
│ │ → Generate answer (Bedrock LLM)
└────────┬────────┘
↓
Response (Streaming)
Latency Breakdown (Typical RAG)
| Step | Typical Latency |
|---|---|
| Query Embedding | 50-100ms |
| Snowflake Vector Search | 50-150ms |
| LLM Generation | 500ms-2000ms ← Bottleneck |
2. Streaming Responses
The Problem
Waiting 2 seconds for full LLM response feels slow.
The Solution
Send tokens as they're generated (like ChatGPT typing effect).
Before (blocking):
LLM Lambda → Wait 2s → Return full response
After (streaming):
LLM Lambda → Token 1 (50ms) → Token 2 (50ms) → ...
User sees: "The answer is..." (immediate feedback!)
Lambda Streaming Support
| Runtime | Streaming Support |
|---|---|
| Node.js | ✅ Native support |
| Python | ⚠️ Requires Lambda Web Adapter |
💡 Note: Lambda streaming is natively supported in Node.js. Python requires the Lambda Web Adapter, which adds complexity. TypeScript is often preferred for simpler implementation and better ecosystem support for streaming.
3. Lambda Scaling Fundamentals
Default Behavior
- Lambda auto-scales to match incoming requests
- Default: 1000 concurrent executions per account
- Can request increase from AWS
Why Limit Concurrency?
💡 Sticky Analogy: Highway & Toll Booth 🚗
- Highway (Lambda): Can handle 10,000 cars
- Toll booth (Snowflake/Bedrock): Only 100 cars can pass at once
If you send 10,000 cars, 9,900 crash at the toll booth! Setting concurrency = 100 protects downstream services.
When to Add SQS Queue?
| Scenario | Use SQS? |
|---|---|
| Sync responses needed (chat) | ❌ Direct Lambda |
| Traffic spikes > Lambda limit | ✅ Buffer with SQS |
| Retry logic needed | ✅ DLQ support |
| Order matters | ✅ FIFO queue |
💡 Sticky Analogy: Restaurant 🍽️
- Lambda Alone: Waiter takes order, goes to kitchen, waits, brings food (sync)
- Lambda + SQS: Waiter takes order, puts slip on queue, kitchen picks up when ready (async)
4. SQS, Retry Logic & DLQ
Retry Logic 🔄
The Problem: Things fail temporarily (Snowflake busy, Bedrock timeout, network hiccup)
💡 Sticky Analogy: Calling a Friend 📱 You call your friend, they don't pick up. Do you:
- A) Give up immediately? (no retry)
- B) Try 2-3 more times? (retry logic) ✅
SQS automatically retries failed messages (configurable: 1-10 times).
Dead Letter Queue (DLQ) ☠️
The Problem: What if ALL retries fail?
💡 Sticky Analogy: Lost Mail 📬
- Without DLQ: Undeliverable mail → thrown away
- With DLQ: Undeliverable mail → "Return to Sender" pile
DLQ lets you:
- Debug why things failed
- Reprocess failed messages after fixing bugs
- Alert your team ("50 messages failed today!")
Architecture Flow
1000 requests hit API Gateway
↓
All 1000 go into SQS queue
↓
Lambda processes them (auto-scales!)
↓
990 succeed ✅
↓
10 fail → SQS retries automatically
↓
7 succeed on retry ✅
↓
3 still fail → Go to DLQ
↓
Alert: "3 messages in DLQ!"
Visibility Timeout ⏱️
What it is: How long SQS waits before assuming Lambda failed.
💡 Sticky Analogy: Library Book Checkout 📚
- SQS Message = A book
- Lambda taking message = You check out the book
- Visibility Timeout = 14-day loan period
- If you return it = Message deleted ✅
- If you don't = Book goes back on shelf for someone else
Critical Rule:
Lambda Timeout < Visibility Timeout
| Lambda Duration | Visibility Timeout | Result |
|---|---|---|
| 30 seconds | 60 seconds | ✅ Works fine |
| 60 seconds | 30 seconds | ❌ Duplicate processing! |
5. Lambda Configuration for Production
Key Settings for RAG Chat
| Setting | Recommended Value | Why |
|---|---|---|
| Batch Size | 1 | Each user gets own response |
| Concurrency | 100+ | 100 different USERS simultaneously |
| Visibility Timeout | > Lambda time | Prevent duplicates |
💡 Sticky Analogy: Customer Service Call Center ☎️
- Batch Size = 1: Each agent handles ONE customer at a time
- Concurrency = 100: You have 100 agents working simultaneously
Dynamic Scaling
Lambda auto-scales based on queue depth:
| Queue Depth | What Happens |
|---|---|
| 10 messages | ~10 Lambdas spin up |
| 1000 messages | ~100+ Lambdas spin up |
| 0 messages | Scale to 0 (no cost!) |
6. Reserved vs Provisioned Concurrency
Unreserved (Default)
All Lambdas share a pool (default 1000 per account).
💡 Sticky Analogy: Shared Office Parking 🅿️
- Normal day: Marketing uses 20 spots, Sales uses 30, Engineering uses 50
- Marketing event: Marketing takes 900 spots, others can't park!
Reserved Concurrency
Guarantee capacity for your specific Lambda.
RAG Lambda: Reserved = 100
→ Always has 100 "parking spots" guaranteed
→ Other Lambdas can't steal them
| Setting | Pros | Cons |
|---|---|---|
| Unreserved | Flexible, can burst | Others can steal capacity |
| Reserved | Guaranteed capacity | Can't exceed limit |
Provisioned Concurrency (No Cold Starts)
The Problem: First request after idle → Lambda takes 2-5 seconds to "wake up"
Provisioned Concurrency = Keep X Lambdas always warm
| Setting | Cost | Cold Starts |
|---|---|---|
| Reserved only | Pay per use | Yes (first requests) |
| Reserved + Provisioned | Pay 24/7 for warm ones | No |
Scheduled Provisioned Concurrency (Best of Both!)
Warm Lambdas only during peak hours:
5 PM EST: Scale provisioned concurrency to 50
11 PM EST: Scale down to 0
💡 Sticky Analogy: Restaurant Ovens 🍽️ Pre-heat ovens before dinner rush, turn them off after closing.
7. EventBridge Scheduling
What is EventBridge?
A scheduler/event router that triggers actions based on time or events.
💡 Sticky Analogy: Office Building Manager 🏢
- "At 5 PM, turn on lobby lights" → "At 5 PM, warm up 50 Lambdas"
- "At 11 PM, turn off AC" → "At 11 PM, scale down to 0"
Two Main Uses
| Type | Example |
|---|---|
| Scheduled | "Every day at 5 PM EST, do X" |
| Event-driven | "When S3 file uploaded, do Y" |
EventBridge + Auto Scaling Relationship
💡 Sticky Analogy: Thermostat vs Heater 🌡️
- EventBridge = The thermostat schedule ("Heat at 5 PM")
- Auto Scaling = The actual heater that warms the room
EventBridge says WHEN, Auto Scaling does the WORK.
Cron Syntax
cron(0 22 ? * MON-FRI *)
│ │ │ │ │ │
│ │ │ │ │ └─ Any year
│ │ │ │ └───────── Mon-Fri only
│ │ │ └─────────── Any month
│ │ └───────────── Any day of month
│ └──────────────── 22:00 UTC (5 PM EST)
└────────────────── Minute 0
8. CloudFormation Snippets
1. SQS Queue + DLQ
# Dead Letter Queue
RagDLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: rag-dlq
# Main Queue
RagQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: rag-requests
VisibilityTimeout: 60
RedrivePolicy:
deadLetterTargetArn: !GetAtt RagDLQ.Arn
maxReceiveCount: 3 # After 3 failures → DLQ
2. Lambda Function
RagLambda:
Type: AWS::Lambda::Function
Properties:
FunctionName: rag-worker
Runtime: python3.11
Handler: index.handler
Timeout: 50 # Less than visibility timeout!
ReservedConcurrentExecutions: 100 # Reserved capacity
3. Lambda SQS Trigger
LambdaSQSTrigger:
Type: AWS::Lambda::EventSourceMapping
Properties:
EventSourceArn: !GetAtt RagQueue.Arn
FunctionName: !Ref RagLambda
BatchSize: 1 # One message at a time
MaximumBatchingWindowInSeconds: 0 # Process immediately
4. EventBridge Scheduled Rules
# Warm up at 5 PM EST (22:00 UTC)
WarmUpRule:
Type: AWS::Events::Rule
Properties:
ScheduleExpression: "cron(0 22 ? * MON-FRI *)"
# Cool down at 11 PM EST (04:00 UTC)
CoolDownRule:
Type: AWS::Events::Rule
Properties:
ScheduleExpression: "cron(0 4 ? * TUE-SAT *)"
5. Auto Scaling for Provisioned Concurrency
ScalableTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
ServiceNamespace: lambda
ScalableDimension: lambda:function:ProvisionedConcurrency
MinCapacity: 0
MaxCapacity: 50
# Scale UP at 5 PM
ScaleUpAction:
Type: AWS::ApplicationAutoScaling::ScheduledAction
Properties:
ScalableTargetAction:
MinCapacity: 50 # Warm 50 Lambdas!
Schedule: "cron(0 22 ? * MON-FRI *)"
# Scale DOWN at 11 PM
ScaleDownAction:
Type: AWS::ApplicationAutoScaling::ScheduledAction
Properties:
ScalableTargetAction:
MinCapacity: 0 # Back to cold
Schedule: "cron(0 4 ? * TUE-SAT *)"
9. Complete Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ RAG ARCHITECTURE STACK │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ EventBridge │────────►│ Auto Scaling │ │
│ │ (Schedule) │ │ (Warm/Cool) │ │
│ │ "WHEN" │ │ "DO IT" │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ API Gateway │────────►│ SQS Queue │──► DLQ │
│ │ (Entry) │ │ (Buffer) │ (Failures) │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Lambda │ │
│ │ • Reserved: 100 │
│ │ • Provisioned: 50 (peak) │
│ │ • BatchSize: 1 │
│ └──────────────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ Bedrock Snowflake Bedrock │
│ (Embed) (Vector DB) (LLM) │
│ │
└─────────────────────────────────────────────────────────────┘
10. Quick Reference
All Sticky Analogies
| Concept | Analogy |
|---|---|
| Lambda alone vs + SQS | Waiter waits vs puts order slip on queue |
| SQS Retry | Calling friend, trying 2-3 times |
| DLQ | Return to sender pile for undeliverable mail |
| Visibility Timeout | Library book loan period |
| Concurrency limit | Toll booth limiting highway traffic |
| Unreserved capacity | Shared office parking |
| Reserved capacity | Your own reserved parking spots |
| Provisioned concurrency | Pre-heated restaurant ovens |
| EventBridge + Auto Scaling | Thermostat (when) + Heater (do it) |
Key Configuration Rules
| Rule | Why |
|---|---|
Timeout < VisibilityTimeout |
Prevents duplicate processing |
Reserved = downstream limit |
Protects Snowflake/Bedrock |
Provisioned during peak only |
Cost optimization |
maxReceiveCount = 3 |
Retry before giving up |
Production Checklist
✅ SQS Queue with DLQ
✅ Lambda with reserved concurrency
✅ Scheduled provisioned concurrency
✅ BatchSize = 1 for chat
✅ Timeout < VisibilityTimeout
✅ Alerting on DLQ
One-Liners
| Concept | Remember It As... |
|---|---|
| SQS | Buffer for resilience |
| DLQ | Graveyard for debugging |
| Reserved | Guaranteed parking spots |
| Provisioned | Pre-warmed, no cold starts |
| EventBridge | Cron scheduler for AWS |
🎓 Key Takeaways
- SQS + DLQ = Resilience & debugging
- Reserved concurrency = Protection from noisy neighbors
- Scheduled provisioned = Cost-optimized warm Lambdas
- Streaming = Better UX (Node.js native, Python needs adapter)
- Always set
Timeout < VisibilityTimeout