Production RAG Architecture with Lambda Scaling

2025-12-18
AWSRAGLambdaServerlessArchitecture

🏗️ Production RAG Architecture & Lambda Scaling - Complete Summary


📋 Table of Contents

  1. RAG Architecture Overview
  2. Streaming Responses
  3. Lambda Scaling Fundamentals
  4. SQS, Retry Logic & DLQ
  5. Lambda Configuration for Production
  6. Reserved vs Provisioned Concurrency
  7. EventBridge Scheduling
  8. CloudFormation Snippets
  9. Complete Architecture Diagram
  10. Quick Reference

1. RAG Architecture Overview

The Problem

Design a scalable RAG pipeline with:

  • Snowflake as vector database
  • Bedrock (Anthropic) for LLM
  • High throughput + Low latency

The Solution Architecture

User Query
    ↓
┌─────────────────┐
│ API Gateway     │ (Entry point)
└────────┬────────┘
         ↓
┌─────────────────┐
│ SQS Queue       │ (Buffer & resilience)
└────────┬────────┘
         ↓
┌─────────────────┐
│ Lambda          │ → Embed query (Bedrock)
│ (RAG Worker)    │ → Vector search (Snowflake)
│                 │ → Generate answer (Bedrock LLM)
└────────┬────────┘
         ↓
    Response (Streaming)

Latency Breakdown (Typical RAG)

Step Typical Latency
Query Embedding 50-100ms
Snowflake Vector Search 50-150ms
LLM Generation 500ms-2000ms ← Bottleneck

2. Streaming Responses

The Problem

Waiting 2 seconds for full LLM response feels slow.

The Solution

Send tokens as they're generated (like ChatGPT typing effect).

Before (blocking):

LLM Lambda → Wait 2s → Return full response

After (streaming):

LLM Lambda → Token 1 (50ms) → Token 2 (50ms) → ...
User sees: "The answer is..." (immediate feedback!)

Lambda Streaming Support

Runtime Streaming Support
Node.js ✅ Native support
Python ⚠️ Requires Lambda Web Adapter

💡 Note: Lambda streaming is natively supported in Node.js. Python requires the Lambda Web Adapter, which adds complexity. TypeScript is often preferred for simpler implementation and better ecosystem support for streaming.


3. Lambda Scaling Fundamentals

Default Behavior

  • Lambda auto-scales to match incoming requests
  • Default: 1000 concurrent executions per account
  • Can request increase from AWS

Why Limit Concurrency?

💡 Sticky Analogy: Highway & Toll Booth 🚗

  • Highway (Lambda): Can handle 10,000 cars
  • Toll booth (Snowflake/Bedrock): Only 100 cars can pass at once

If you send 10,000 cars, 9,900 crash at the toll booth! Setting concurrency = 100 protects downstream services.

When to Add SQS Queue?

Scenario Use SQS?
Sync responses needed (chat) ❌ Direct Lambda
Traffic spikes > Lambda limit ✅ Buffer with SQS
Retry logic needed ✅ DLQ support
Order matters ✅ FIFO queue

💡 Sticky Analogy: Restaurant 🍽️

  • Lambda Alone: Waiter takes order, goes to kitchen, waits, brings food (sync)
  • Lambda + SQS: Waiter takes order, puts slip on queue, kitchen picks up when ready (async)

4. SQS, Retry Logic & DLQ

Retry Logic 🔄

The Problem: Things fail temporarily (Snowflake busy, Bedrock timeout, network hiccup)

💡 Sticky Analogy: Calling a Friend 📱 You call your friend, they don't pick up. Do you:

  • A) Give up immediately? (no retry)
  • B) Try 2-3 more times? (retry logic) ✅

SQS automatically retries failed messages (configurable: 1-10 times).

Dead Letter Queue (DLQ) ☠️

The Problem: What if ALL retries fail?

💡 Sticky Analogy: Lost Mail 📬

  • Without DLQ: Undeliverable mail → thrown away
  • With DLQ: Undeliverable mail → "Return to Sender" pile

DLQ lets you:

  • Debug why things failed
  • Reprocess failed messages after fixing bugs
  • Alert your team ("50 messages failed today!")

Architecture Flow

1000 requests hit API Gateway
        ↓
All 1000 go into SQS queue
        ↓
Lambda processes them (auto-scales!)
        ↓
990 succeed ✅
        ↓
10 fail → SQS retries automatically
        ↓
7 succeed on retry ✅
        ↓
3 still fail → Go to DLQ
        ↓
Alert: "3 messages in DLQ!"

Visibility Timeout ⏱️

What it is: How long SQS waits before assuming Lambda failed.

💡 Sticky Analogy: Library Book Checkout 📚

  • SQS Message = A book
  • Lambda taking message = You check out the book
  • Visibility Timeout = 14-day loan period
  • If you return it = Message deleted ✅
  • If you don't = Book goes back on shelf for someone else

Critical Rule:

Lambda Timeout < Visibility Timeout
Lambda Duration Visibility Timeout Result
30 seconds 60 seconds ✅ Works fine
60 seconds 30 seconds ❌ Duplicate processing!

5. Lambda Configuration for Production

Key Settings for RAG Chat

Setting Recommended Value Why
Batch Size 1 Each user gets own response
Concurrency 100+ 100 different USERS simultaneously
Visibility Timeout > Lambda time Prevent duplicates

💡 Sticky Analogy: Customer Service Call Center ☎️

  • Batch Size = 1: Each agent handles ONE customer at a time
  • Concurrency = 100: You have 100 agents working simultaneously

Dynamic Scaling

Lambda auto-scales based on queue depth:

Queue Depth What Happens
10 messages ~10 Lambdas spin up
1000 messages ~100+ Lambdas spin up
0 messages Scale to 0 (no cost!)

6. Reserved vs Provisioned Concurrency

Unreserved (Default)

All Lambdas share a pool (default 1000 per account).

💡 Sticky Analogy: Shared Office Parking 🅿️

  • Normal day: Marketing uses 20 spots, Sales uses 30, Engineering uses 50
  • Marketing event: Marketing takes 900 spots, others can't park!

Reserved Concurrency

Guarantee capacity for your specific Lambda.

RAG Lambda: Reserved = 100
→ Always has 100 "parking spots" guaranteed
→ Other Lambdas can't steal them
Setting Pros Cons
Unreserved Flexible, can burst Others can steal capacity
Reserved Guaranteed capacity Can't exceed limit

Provisioned Concurrency (No Cold Starts)

The Problem: First request after idle → Lambda takes 2-5 seconds to "wake up"

Provisioned Concurrency = Keep X Lambdas always warm

Setting Cost Cold Starts
Reserved only Pay per use Yes (first requests)
Reserved + Provisioned Pay 24/7 for warm ones No

Scheduled Provisioned Concurrency (Best of Both!)

Warm Lambdas only during peak hours:

5 PM EST: Scale provisioned concurrency to 50
11 PM EST: Scale down to 0

💡 Sticky Analogy: Restaurant Ovens 🍽️ Pre-heat ovens before dinner rush, turn them off after closing.


7. EventBridge Scheduling

What is EventBridge?

A scheduler/event router that triggers actions based on time or events.

💡 Sticky Analogy: Office Building Manager 🏢

  • "At 5 PM, turn on lobby lights" → "At 5 PM, warm up 50 Lambdas"
  • "At 11 PM, turn off AC" → "At 11 PM, scale down to 0"

Two Main Uses

Type Example
Scheduled "Every day at 5 PM EST, do X"
Event-driven "When S3 file uploaded, do Y"

EventBridge + Auto Scaling Relationship

💡 Sticky Analogy: Thermostat vs Heater 🌡️

  • EventBridge = The thermostat schedule ("Heat at 5 PM")
  • Auto Scaling = The actual heater that warms the room

EventBridge says WHEN, Auto Scaling does the WORK.

Cron Syntax

cron(0 22 ? * MON-FRI *)
     │ │  │ │ │       │
     │ │  │ │ │       └─ Any year
     │ │  │ │ └───────── Mon-Fri only
     │ │  │ └─────────── Any month
     │ │  └───────────── Any day of month
     │ └──────────────── 22:00 UTC (5 PM EST)
     └────────────────── Minute 0

8. CloudFormation Snippets

1. SQS Queue + DLQ

# Dead Letter Queue
RagDLQ:
  Type: AWS::SQS::Queue
  Properties:
    QueueName: rag-dlq

# Main Queue
RagQueue:
  Type: AWS::SQS::Queue
  Properties:
    QueueName: rag-requests
    VisibilityTimeout: 60
    RedrivePolicy:
      deadLetterTargetArn: !GetAtt RagDLQ.Arn
      maxReceiveCount: 3  # After 3 failures → DLQ

2. Lambda Function

RagLambda:
  Type: AWS::Lambda::Function
  Properties:
    FunctionName: rag-worker
    Runtime: python3.11
    Handler: index.handler
    Timeout: 50  # Less than visibility timeout!
    ReservedConcurrentExecutions: 100  # Reserved capacity

3. Lambda SQS Trigger

LambdaSQSTrigger:
  Type: AWS::Lambda::EventSourceMapping
  Properties:
    EventSourceArn: !GetAtt RagQueue.Arn
    FunctionName: !Ref RagLambda
    BatchSize: 1  # One message at a time
    MaximumBatchingWindowInSeconds: 0  # Process immediately

4. EventBridge Scheduled Rules

# Warm up at 5 PM EST (22:00 UTC)
WarmUpRule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "cron(0 22 ? * MON-FRI *)"

# Cool down at 11 PM EST (04:00 UTC)
CoolDownRule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "cron(0 4 ? * TUE-SAT *)"

5. Auto Scaling for Provisioned Concurrency

ScalableTarget:
  Type: AWS::ApplicationAutoScaling::ScalableTarget
  Properties:
    ServiceNamespace: lambda
    ScalableDimension: lambda:function:ProvisionedConcurrency
    MinCapacity: 0
    MaxCapacity: 50

# Scale UP at 5 PM
ScaleUpAction:
  Type: AWS::ApplicationAutoScaling::ScheduledAction
  Properties:
    ScalableTargetAction:
      MinCapacity: 50  # Warm 50 Lambdas!
    Schedule: "cron(0 22 ? * MON-FRI *)"

# Scale DOWN at 11 PM
ScaleDownAction:
  Type: AWS::ApplicationAutoScaling::ScheduledAction
  Properties:
    ScalableTargetAction:
      MinCapacity: 0  # Back to cold
    Schedule: "cron(0 4 ? * TUE-SAT *)"

9. Complete Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    RAG ARCHITECTURE STACK                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐         ┌──────────────┐                 │
│  │ EventBridge  │────────►│ Auto Scaling │                 │
│  │ (Schedule)   │         │ (Warm/Cool)  │                 │
│  │ "WHEN"       │         │ "DO IT"      │                 │
│  └──────────────┘         └──────┬───────┘                 │
│                                  │                          │
│                                  ▼                          │
│  ┌──────────────┐         ┌──────────────┐                 │
│  │ API Gateway  │────────►│  SQS Queue   │──► DLQ          │
│  │ (Entry)      │         │  (Buffer)    │   (Failures)    │
│  └──────────────┘         └──────┬───────┘                 │
│                                  │                          │
│                                  ▼                          │
│                          ┌──────────────┐                   │
│                          │   Lambda     │                   │
│                          │ • Reserved: 100                  │
│                          │ • Provisioned: 50 (peak)         │
│                          │ • BatchSize: 1                   │
│                          └──────────────┘                   │
│                                  │                          │
│                    ┌─────────────┼─────────────┐            │
│                    ▼             ▼             ▼            │
│               Bedrock       Snowflake      Bedrock          │
│              (Embed)       (Vector DB)      (LLM)           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

10. Quick Reference

All Sticky Analogies

Concept Analogy
Lambda alone vs + SQS Waiter waits vs puts order slip on queue
SQS Retry Calling friend, trying 2-3 times
DLQ Return to sender pile for undeliverable mail
Visibility Timeout Library book loan period
Concurrency limit Toll booth limiting highway traffic
Unreserved capacity Shared office parking
Reserved capacity Your own reserved parking spots
Provisioned concurrency Pre-heated restaurant ovens
EventBridge + Auto Scaling Thermostat (when) + Heater (do it)

Key Configuration Rules

Rule Why
Timeout < VisibilityTimeout Prevents duplicate processing
Reserved = downstream limit Protects Snowflake/Bedrock
Provisioned during peak only Cost optimization
maxReceiveCount = 3 Retry before giving up

Production Checklist

✅ SQS Queue with DLQ
✅ Lambda with reserved concurrency
✅ Scheduled provisioned concurrency
✅ BatchSize = 1 for chat
✅ Timeout < VisibilityTimeout
✅ Alerting on DLQ

One-Liners

Concept Remember It As...
SQS Buffer for resilience
DLQ Graveyard for debugging
Reserved Guaranteed parking spots
Provisioned Pre-warmed, no cold starts
EventBridge Cron scheduler for AWS

🎓 Key Takeaways

  1. SQS + DLQ = Resilience & debugging
  2. Reserved concurrency = Protection from noisy neighbors
  3. Scheduled provisioned = Cost-optimized warm Lambdas
  4. Streaming = Better UX (Node.js native, Python needs adapter)
  5. Always set Timeout < VisibilityTimeout