🏗️ Production RAG Architecture & Lambda Scaling - Complete Summary

📋 Table of Contents

RAG Architecture Overview
Streaming Responses
Lambda Scaling Fundamentals
SQS, Retry Logic & DLQ
Lambda Configuration for Production
Reserved vs Provisioned Concurrency
EventBridge Scheduling
CloudFormation Snippets
Complete Architecture Diagram
Quick Reference

1. RAG Architecture Overview

The Problem

Design a scalable RAG pipeline with:

Snowflake as vector database
Bedrock (Anthropic) for LLM
High throughput + Low latency

The Solution Architecture

User Query
    ↓
┌─────────────────┐
│ API Gateway     │ (Entry point)
└────────┬────────┘
         ↓
┌─────────────────┐
│ SQS Queue       │ (Buffer & resilience)
└────────┬────────┘
         ↓
┌─────────────────┐
│ Lambda          │ → Embed query (Bedrock)
│ (RAG Worker)    │ → Vector search (Snowflake)
│                 │ → Generate answer (Bedrock LLM)
└────────┬────────┘
         ↓
    Response (Streaming)

Latency Breakdown (Typical RAG)

Step	Typical Latency
Query Embedding	50-100ms
Snowflake Vector Search	50-150ms
LLM Generation	500ms-2000ms ← Bottleneck

2. Streaming Responses

The Problem

Waiting 2 seconds for full LLM response feels slow.

The Solution

Send tokens as they're generated (like ChatGPT typing effect).

Before (blocking):

LLM Lambda → Wait 2s → Return full response

After (streaming):

LLM Lambda → Token 1 (50ms) → Token 2 (50ms) → ...
User sees: "The answer is..." (immediate feedback!)

Lambda Streaming Support

Runtime	Streaming Support
Node.js	✅ Native support
Python	⚠️ Requires Lambda Web Adapter

💡 Note: Lambda streaming is natively supported in Node.js. Python requires the Lambda Web Adapter, which adds complexity. TypeScript is often preferred for simpler implementation and better ecosystem support for streaming.

3. Lambda Scaling Fundamentals

Default Behavior

Lambda auto-scales to match incoming requests
Default: 1000 concurrent executions per account
Can request increase from AWS

Why Limit Concurrency?

💡 Sticky Analogy: Highway & Toll Booth 🚗

Highway (Lambda): Can handle 10,000 cars

Toll booth (Snowflake/Bedrock): Only 100 cars can pass at once

If you send 10,000 cars, 9,900 crash at the toll booth! Setting concurrency = 100 protects downstream services.

When to Add SQS Queue?

Scenario	Use SQS?
Sync responses needed (chat)	❌ Direct Lambda
Traffic spikes > Lambda limit	✅ Buffer with SQS
Retry logic needed	✅ DLQ support
Order matters	✅ FIFO queue

💡 Sticky Analogy: Restaurant 🍽️

Lambda Alone: Waiter takes order, goes to kitchen, waits, brings food (sync)

Lambda + SQS: Waiter takes order, puts slip on queue, kitchen picks up when ready (async)

4. SQS, Retry Logic & DLQ

Retry Logic 🔄

The Problem: Things fail temporarily (Snowflake busy, Bedrock timeout, network hiccup)

💡 Sticky Analogy: Calling a Friend 📱 You call your friend, they don't pick up. Do you:

A) Give up immediately? (no retry)

B) Try 2-3 more times? (retry logic) ✅

SQS automatically retries failed messages (configurable: 1-10 times).

Dead Letter Queue (DLQ) ☠️

The Problem: What if ALL retries fail?

💡 Sticky Analogy: Lost Mail 📬

Without DLQ: Undeliverable mail → thrown away

With DLQ: Undeliverable mail → "Return to Sender" pile

DLQ lets you:

Debug why things failed
Reprocess failed messages after fixing bugs
Alert your team ("50 messages failed today!")

Architecture Flow

1000 requests hit API Gateway
        ↓
All 1000 go into SQS queue
        ↓
Lambda processes them (auto-scales!)
        ↓
990 succeed ✅
        ↓
10 fail → SQS retries automatically
        ↓
7 succeed on retry ✅
        ↓
3 still fail → Go to DLQ
        ↓
Alert: "3 messages in DLQ!"

Visibility Timeout ⏱️

What it is: How long SQS waits before assuming Lambda failed.

💡 Sticky Analogy: Library Book Checkout 📚

SQS Message = A book

Lambda taking message = You check out the book

Visibility Timeout = 14-day loan period

If you return it = Message deleted ✅

If you don't = Book goes back on shelf for someone else

Critical Rule:

Lambda Timeout < Visibility Timeout

Lambda Duration	Visibility Timeout	Result
30 seconds	60 seconds	✅ Works fine
60 seconds	30 seconds	❌ Duplicate processing!

5. Lambda Configuration for Production

Key Settings for RAG Chat

Setting	Recommended Value	Why
Batch Size	1	Each user gets own response
Concurrency	100+	100 different USERS simultaneously
Visibility Timeout	> Lambda time	Prevent duplicates

💡 Sticky Analogy: Customer Service Call Center ☎️

Batch Size = 1: Each agent handles ONE customer at a time

Concurrency = 100: You have 100 agents working simultaneously

Dynamic Scaling

Lambda auto-scales based on queue depth:

Queue Depth	What Happens
10 messages	~10 Lambdas spin up
1000 messages	~100+ Lambdas spin up
0 messages	Scale to 0 (no cost!)

6. Reserved vs Provisioned Concurrency

Unreserved (Default)

All Lambdas share a pool (default 1000 per account).

💡 Sticky Analogy: Shared Office Parking 🅿️

Normal day: Marketing uses 20 spots, Sales uses 30, Engineering uses 50

Marketing event: Marketing takes 900 spots, others can't park!

Reserved Concurrency

Guarantee capacity for your specific Lambda.

RAG Lambda: Reserved = 100
→ Always has 100 "parking spots" guaranteed
→ Other Lambdas can't steal them

Setting	Pros	Cons
Unreserved	Flexible, can burst	Others can steal capacity
Reserved	Guaranteed capacity	Can't exceed limit

Provisioned Concurrency (No Cold Starts)

The Problem: First request after idle → Lambda takes 2-5 seconds to "wake up"

Provisioned Concurrency = Keep X Lambdas always warm

Setting	Cost	Cold Starts
Reserved only	Pay per use	Yes (first requests)
Reserved + Provisioned	Pay 24/7 for warm ones	No

Scheduled Provisioned Concurrency (Best of Both!)

Warm Lambdas only during peak hours:

5 PM EST: Scale provisioned concurrency to 50
11 PM EST: Scale down to 0

💡 Sticky Analogy: Restaurant Ovens 🍽️ Pre-heat ovens before dinner rush, turn them off after closing.

7. EventBridge Scheduling

What is EventBridge?

A scheduler/event router that triggers actions based on time or events.

💡 Sticky Analogy: Office Building Manager 🏢

"At 5 PM, turn on lobby lights" → "At 5 PM, warm up 50 Lambdas"

"At 11 PM, turn off AC" → "At 11 PM, scale down to 0"

Two Main Uses

Type	Example
Scheduled	"Every day at 5 PM EST, do X"
Event-driven	"When S3 file uploaded, do Y"

EventBridge + Auto Scaling Relationship

💡 Sticky Analogy: Thermostat vs Heater 🌡️

EventBridge = The thermostat schedule ("Heat at 5 PM")

Auto Scaling = The actual heater that warms the room

EventBridge says WHEN, Auto Scaling does the WORK.

Cron Syntax

cron(0 22 ? * MON-FRI *)
     │ │  │ │ │       │
     │ │  │ │ │       └─ Any year
     │ │  │ │ └───────── Mon-Fri only
     │ │  │ └─────────── Any month
     │ │  └───────────── Any day of month
     │ └──────────────── 22:00 UTC (5 PM EST)
     └────────────────── Minute 0

8. CloudFormation Snippets

1. SQS Queue + DLQ

# Dead Letter Queue
RagDLQ:
  Type: AWS::SQS::Queue
  Properties:
    QueueName: rag-dlq

# Main Queue
RagQueue:
  Type: AWS::SQS::Queue
  Properties:
    QueueName: rag-requests
    VisibilityTimeout: 60
    RedrivePolicy:
      deadLetterTargetArn: !GetAtt RagDLQ.Arn
      maxReceiveCount: 3  # After 3 failures → DLQ

2. Lambda Function

RagLambda:
  Type: AWS::Lambda::Function
  Properties:
    FunctionName: rag-worker
    Runtime: python3.11
    Handler: index.handler
    Timeout: 50  # Less than visibility timeout!
    ReservedConcurrentExecutions: 100  # Reserved capacity

3. Lambda SQS Trigger

LambdaSQSTrigger:
  Type: AWS::Lambda::EventSourceMapping
  Properties:
    EventSourceArn: !GetAtt RagQueue.Arn
    FunctionName: !Ref RagLambda
    BatchSize: 1  # One message at a time
    MaximumBatchingWindowInSeconds: 0  # Process immediately

4. EventBridge Scheduled Rules

# Warm up at 5 PM EST (22:00 UTC)
WarmUpRule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "cron(0 22 ? * MON-FRI *)"

# Cool down at 11 PM EST (04:00 UTC)
CoolDownRule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "cron(0 4 ? * TUE-SAT *)"

5. Auto Scaling for Provisioned Concurrency

ScalableTarget:
  Type: AWS::ApplicationAutoScaling::ScalableTarget
  Properties:
    ServiceNamespace: lambda
    ScalableDimension: lambda:function:ProvisionedConcurrency
    MinCapacity: 0
    MaxCapacity: 50

# Scale UP at 5 PM
ScaleUpAction:
  Type: AWS::ApplicationAutoScaling::ScheduledAction
  Properties:
    ScalableTargetAction:
      MinCapacity: 50  # Warm 50 Lambdas!
    Schedule: "cron(0 22 ? * MON-FRI *)"

# Scale DOWN at 11 PM
ScaleDownAction:
  Type: AWS::ApplicationAutoScaling::ScheduledAction
  Properties:
    ScalableTargetAction:
      MinCapacity: 0  # Back to cold
    Schedule: "cron(0 4 ? * TUE-SAT *)"

9. Complete Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    RAG ARCHITECTURE STACK                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐         ┌──────────────┐                 │
│  │ EventBridge  │────────►│ Auto Scaling │                 │
│  │ (Schedule)   │         │ (Warm/Cool)  │                 │
│  │ "WHEN"       │         │ "DO IT"      │                 │
│  └──────────────┘         └──────┬───────┘                 │
│                                  │                          │
│                                  ▼                          │
│  ┌──────────────┐         ┌──────────────┐                 │
│  │ API Gateway  │────────►│  SQS Queue   │──► DLQ          │
│  │ (Entry)      │         │  (Buffer)    │   (Failures)    │
│  └──────────────┘         └──────┬───────┘                 │
│                                  │                          │
│                                  ▼                          │
│                          ┌──────────────┐                   │
│                          │   Lambda     │                   │
│                          │ • Reserved: 100                  │
│                          │ • Provisioned: 50 (peak)         │
│                          │ • BatchSize: 1                   │
│                          └──────────────┘                   │
│                                  │                          │
│                    ┌─────────────┼─────────────┐            │
│                    ▼             ▼             ▼            │
│               Bedrock       Snowflake      Bedrock          │
│              (Embed)       (Vector DB)      (LLM)           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

10. Quick Reference

All Sticky Analogies

Concept	Analogy
Lambda alone vs + SQS	Waiter waits vs puts order slip on queue
SQS Retry	Calling friend, trying 2-3 times
DLQ	Return to sender pile for undeliverable mail
Visibility Timeout	Library book loan period
Concurrency limit	Toll booth limiting highway traffic
Unreserved capacity	Shared office parking
Reserved capacity	Your own reserved parking spots
Provisioned concurrency	Pre-heated restaurant ovens
EventBridge + Auto Scaling	Thermostat (when) + Heater (do it)

Key Configuration Rules

Rule	Why
`Timeout < VisibilityTimeout`	Prevents duplicate processing
`Reserved = downstream limit`	Protects Snowflake/Bedrock
`Provisioned during peak only`	Cost optimization
`maxReceiveCount = 3`	Retry before giving up

Production Checklist

✅ SQS Queue with DLQ
✅ Lambda with reserved concurrency
✅ Scheduled provisioned concurrency
✅ BatchSize = 1 for chat
✅ Timeout < VisibilityTimeout
✅ Alerting on DLQ

One-Liners

Concept	Remember It As...
SQS	Buffer for resilience
DLQ	Graveyard for debugging
Reserved	Guaranteed parking spots
Provisioned	Pre-warmed, no cold starts
EventBridge	Cron scheduler for AWS

🎓 Key Takeaways

SQS + DLQ = Resilience & debugging
Reserved concurrency = Protection from noisy neighbors
Scheduled provisioned = Cost-optimized warm Lambdas
Streaming = Better UX (Node.js native, Python needs adapter)
Always set Timeout < VisibilityTimeout

Production RAG Architecture with Lambda Scaling

🏗️ Production RAG Architecture & Lambda Scaling - Complete Summary

📋 Table of Contents

1. RAG Architecture Overview

The Problem

The Solution Architecture

Latency Breakdown (Typical RAG)

2. Streaming Responses

The Problem

The Solution

Lambda Streaming Support

3. Lambda Scaling Fundamentals

Default Behavior

Why Limit Concurrency?

When to Add SQS Queue?

4. SQS, Retry Logic & DLQ

Retry Logic 🔄

Dead Letter Queue (DLQ) ☠️

Architecture Flow

Visibility Timeout ⏱️

5. Lambda Configuration for Production

Key Settings for RAG Chat

Dynamic Scaling

6. Reserved vs Provisioned Concurrency

Unreserved (Default)

Reserved Concurrency

Provisioned Concurrency (No Cold Starts)

Scheduled Provisioned Concurrency (Best of Both!)

7. EventBridge Scheduling

What is EventBridge?

Two Main Uses

EventBridge + Auto Scaling Relationship

Cron Syntax

8. CloudFormation Snippets

1. SQS Queue + DLQ

2. Lambda Function

3. Lambda SQS Trigger

4. EventBridge Scheduled Rules

5. Auto Scaling for Provisioned Concurrency

9. Complete Architecture Diagram

10. Quick Reference

All Sticky Analogies

Key Configuration Rules

Production Checklist

One-Liners

🎓 Key Takeaways