🧠 SageMaker Serverless Exploration - Complete Summary

Total Cost: $0.00 🎉

📋 Table of Contents

Architecture Overview
IAM Role Setup
SageMaker SDK Setup
Deploying a Serverless Endpoint
Testing the Endpoint
Cleanup & Cost Management
Production Workflows
High-Performance Options
CPU vs GPU Selection
Quick Reference

💡 Sticky Analogy: Food Truck Service

Think of SageMaker Serverless as ordering a food truck on-demand:

IAM Role = Your ID badge proving you're allowed to order

HuggingFaceModel = The menu item you're ordering

ServerlessInferenceConfig = Delivery preferences (memory, concurrency)

model.deploy() = Actually placing the order

Endpoint = The food truck arrives and flips the "OPEN" sign

🔐 IAM Role Setup

Why We Need It

SageMaker needs permission to access S3, ECR, and other AWS services on your behalf.

💡 Analogy: Like giving a delivery driver your house key to drop off packages while you're away.

What We Created

Role Name: SageMakerExecutionRole
ARN: arn:aws:iam::609662024349:role/SageMakerExecutionRole
Trust Policy: Allows sagemaker.amazonaws.com to assume the role
Permission Policy: AmazonSageMakerFullAccess

Trust Policy JSON

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

📦 SageMaker SDK Setup

Version Compatibility Issue

Problem: ModuleNotFoundError: No module named 'sagemaker.huggingface'

Root Cause: SageMaker v3.x restructured modules - HuggingFace integration was removed/moved.

💡 Analogy: App Store Update

Like buying a new iPhone and finding your favorite app hasn't been updated for the new iOS yet. Rolling back to v2 is like using the "classic" version that still has everything built-in.

Solution

pip3 install "sagemaker>=2.0,<3.0"

Version Comparison

Version	HuggingFaceModel	Notes
v3.x	❌ Not bundled	Modular architecture
v2.x	✅ Included	Use this for HuggingFace

🚀 Deploying a Serverless Endpoint

Initial Approach (Failed)

Using S3 path directly:

model = HuggingFaceModel(
    model_data="s3://huggingface-sagemaker-models/...",  # ❌ Access denied
    ...
)

Error: ValidationException: Could not access model data at s3://...

💡 Analogy: Supplier vs Warehouse

Instead of giving the delivery truck a specific warehouse address that might be outdated, tell them "order directly from the supplier" (HuggingFace Hub) - always fresh and accessible!

Working Solution

Using HuggingFace Hub directly via environment variable:

# sagemaker-test.py - Working deployment script
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.serverless import ServerlessInferenceConfig

role = "arn:aws:iam::609662024349:role/SageMakerExecutionRole"

# Use HuggingFace Hub directly instead of S3
model = HuggingFaceModel(
    transformers_version="4.26",
    pytorch_version="1.13",
    py_version="py39",
    role=role,
    env={"HF_MODEL_ID": "distilbert-base-uncased-finetuned-sst-2-english"}
)

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=1
)

predictor = model.deploy(serverless_inference_config=serverless_config)
print(f"Endpoint name: {predictor.endpoint_name}")

Deployment Progress

----!
Endpoint name: huggingface-pytorch-inference-2025-12-23-13-07-31-668

What the symbols mean:

Each - = Health check in progress
! = Endpoint is ready!

💡 Analogy: The food truck is driving to the location, setting up the kitchen, firing up the grill, and flipping the "OPEN" sign.

🧪 Testing the Endpoint

Shell Quoting Lesson Learned

Problem: Inline Python via SSH causes quote escaping nightmares.

💡 Analogy: Noisy Drive-Through

Instead of shouting a complicated order through a noisy speaker (nested shell quotes), write it on paper first (file), then hand it through the window!

Solution: File Approach

# test-endpoint.py - Inference script
import boto3
import json

runtime = boto3.client("sagemaker-runtime", region_name="us-east-1")

response = runtime.invoke_endpoint(
    EndpointName="huggingface-pytorch-inference-2025-12-23-13-07-31-668",
    ContentType="application/json",
    Body=json.dumps({"inputs": "I love learning AWS!"})
)

print(json.loads(response["Body"].read().decode()))

Test Results

Input	Label	Score
"I love learning AWS!"	POSITIVE	99.95%

[{"label": "POSITIVE", "score": 0.9995132684707642}]

💡 The model is like a mood detector - it reads the emotional tone of text and tells you whether it's positive or negative, with a confidence percentage.

🧹 Cleanup & Cost Management

Why Cleanup Matters

💡 Analogy: Closing the Food Truck

The endpoint is like a food truck parked with the "OPEN" sign on. Even if no customers come, there's a small cost for being ready to serve. Deleting = packing up and leaving!

Cleanup Commands

# 1. Delete endpoint (stops billing)
aws sagemaker delete-endpoint --endpoint-name <endpoint-name>

# 2. Delete endpoint config (free, but keeps things clean)
aws sagemaker delete-endpoint-config --endpoint-config-name <config-name>

# 3. Delete model (free, but keeps things clean)
aws sagemaker delete-model --model-name <model-name>

# 4. Verify everything is gone
aws sagemaker list-endpoints           # Should be empty
aws sagemaker list-endpoint-configs    # Should be empty
aws sagemaker list-models              # Should be empty

What Are These Resources?

Resource	What It Is	Cost	Analogy
Endpoint	Running inference service	💰 Charges	The food truck serving
Endpoint Config	Blueprint for endpoint setup	Free	Recipe card
Model	Registration record pointing to model	Free	Catalog entry

AWS Billing Note

AWS billing has a 6-24 hour delay. Charges may not appear immediately, but for a few test calls, expect fractions of a penny.

🏭 Production Workflows

Learning vs Production

Stage	Source	Code
Learning	HuggingFace Hub	`env={"HF_MODEL_ID": "..."}`
Production	Your S3 bucket	`model_data="s3://your-bucket/model.tar.gz"`

Production Flow

Train model locally/SageMaker
        ↓
Save/export model (model.tar.gz)
        ↓
Upload to YOUR S3 bucket
        ↓
Deploy from S3

💡 Analogy: Restaurant vs Home Cooking

Today: Ordered pre-made dish from restaurant (HuggingFace Hub)

Production: Cook your own recipe, package it, store in your pantry (S3), serve from there

🚀 High-Performance Options

Endpoint Type Comparison

Type	Behavior	Best For
Serverless	Spins up on-demand, scales to zero	Low traffic, cost-sensitive, dev/test
Real-time	Instance runs 24/7	High throughput, low latency, production
Async	Queue-based, for long jobs	Large payloads, batch processing

Serverless vs Real-time Trade-offs

	Serverless	Real-time
Cold start	10-30 sec first call	None (always warm)
Latency	Higher	Lower (~ms)
Cost when idle	$0	Paying 24/7
High traffic	❌	✅

💡 Analogy:

Serverless = Food truck that parks only when you call (cheap but slow to arrive)

Real-time = Restaurant that's always open (instant service but paying rent 24/7)

High Throughput + Low Latency Solution

Real-time Endpoints with Auto-Scaling

💡 Analogy: Fleet of Food Trucks

Instead of one truck that shows up when called (serverless), you have a fleet that automatically dispatches more trucks during lunch rush and sends them home when quiet.

# 1. Deploy real-time (not serverless)
predictor = model.deploy(
    initial_instance_count=2,      # Start with 2 instances
    instance_type="ml.m5.large"    # Always-on instance type
)

# 2. Add auto-scaling
import boto3

client = boto3.client("application-autoscaling")

# Register scalable target
client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=2,
    MaxCapacity=10
)

# Add scaling policy (scale based on invocations)
client.put_scaling_policy(
    PolicyName="scale-on-invocations",
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 70.0,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleOutCooldown": 60,
        "ScaleInCooldown": 300
    }
)

High-Performance Checklist 📋

When asked "How do you handle throughput & latency in SageMaker?", know these:

Concept	Remember It As...
Endpoint Type	How eager is your service? (Always ready / Wake on call / Queue it)
Instance Selection	Brains (CPU) vs Muscle (GPU) - match worker to job
Scaling Strategy	When to hire/fire more workers
Cooldown Periods	Don't panic-hire or panic-fire
Model Optimization	Make the model faster, not just more hardware

💡 Analogy: Restaurant Staffing

Running a high-performance ML service is like managing a restaurant - decide if you're 24/7 or pop-up (endpoint), hire cooks vs dishwashers (instance), know when to call in extra staff (scaling), don't overreact to one busy hour (cooldown), and train your staff to work faster (optimization).

🧠 CPU vs GPU Selection

The Confusion

"Why do we need GPU for inference? I thought GPU was only for training."

The Answer

It depends on model size and throughput, not just training vs inference.

Scenario	CPU	GPU
Training	❌ (too slow)	✅ Always
Inference - Small model	✅	Overkill
Inference - Large model (BERT, GPT)	❌ (too slow)	✅
Inference - High batch volume	❌	✅

Quick Decision Guide

Use CPU	Use GPU
Traditional ML (XGBoost, RF)	Deep Learning (Transformers, CNNs)
Small models	Large models (100M+ params)
Low inference volume	High batch throughput
Cost-sensitive	Latency-critical

Simple Rule: If it's a neural network AND (large OR fast) → GPU

💡 Analogy: Pizza Kitchen

Even after you've learned to cook (training), making 100 pizzas at once (inference) still needs industrial ovens (GPU). But making one sandwich? A regular kitchen (CPU) works fine!

📚 Quick Reference

Complete Command Sequence

# 1. Install SDK (use v2 for HuggingFace)
pip3 install "sagemaker>=2.0,<3.0"

# 2. Deploy (see Python script above)
python3 sagemaker-test.py

# 3. Test
python3 test-endpoint.py

# 4. Cleanup
aws sagemaker delete-endpoint --endpoint-name <name>
aws sagemaker delete-endpoint-config --endpoint-config-name <name>
aws sagemaker delete-model --model-name <name>

# 5. Verify
aws sagemaker list-endpoints
aws sagemaker list-endpoint-configs
aws sagemaker list-models

All Sticky Analogies

Concept	Analogy
SageMaker Serverless	Food truck that arrives on-demand
IAM Role	ID badge / house key for delivery driver
HF Hub vs S3	Ordering from supplier vs specific warehouse
SDK v3 vs v2	New iPhone missing your favorite app
Shell quoting	Noisy drive-through vs written order
Endpoint deletion	Closing the food truck
Real-time + Auto-scaling	Fleet of food trucks
CPU vs GPU	Brains vs Muscle / Regular vs Industrial kitchen
Cooldown periods	Don't panic-hire or panic-fire

Key Takeaways

Always use SageMaker SDK v2.x for HuggingFace models
Use env={"HF_MODEL_ID": ...} instead of S3 paths for learning
Always clean up endpoints after testing
Serverless = cheap but slow / Real-time = fast but expensive
GPU for inference only for large models or high throughput

✅ What We Accomplished

Step	Status
Created IAM Role (`SageMakerExecutionRole`)	✅
Installed SageMaker SDK (v2)	✅
Deployed serverless HuggingFace model	✅
Tested sentiment analysis	✅
Cleaned up all resources	✅
Total cost	$0.00 🎉

Getting Started with SageMaker Serverless Endpoints