Getting Started with SageMaker Serverless Endpoints
๐ง SageMaker Serverless Exploration - Complete Summary
Total Cost: $0.00 ๐
๐ Table of Contents
- Architecture Overview
- IAM Role Setup
- SageMaker SDK Setup
- Deploying a Serverless Endpoint
- Testing the Endpoint
- Cleanup & Cost Management
- Production Workflows
- High-Performance Options
- CPU vs GPU Selection
- Quick Reference
๐ก Sticky Analogy: Food Truck Service
Think of SageMaker Serverless as ordering a food truck on-demand:
- IAM Role = Your ID badge proving you're allowed to order
- HuggingFaceModel = The menu item you're ordering
- ServerlessInferenceConfig = Delivery preferences (memory, concurrency)
- model.deploy() = Actually placing the order
- Endpoint = The food truck arrives and flips the "OPEN" sign
๐ IAM Role Setup
Why We Need It
SageMaker needs permission to access S3, ECR, and other AWS services on your behalf.
๐ก Analogy: Like giving a delivery driver your house key to drop off packages while you're away.
What We Created
- Role Name:
SageMakerExecutionRole - ARN:
arn:aws:iam::609662024349:role/SageMakerExecutionRole - Trust Policy: Allows
sagemaker.amazonaws.comto assume the role - Permission Policy:
AmazonSageMakerFullAccess
Trust Policy JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
๐ฆ SageMaker SDK Setup
Version Compatibility Issue
Problem: ModuleNotFoundError: No module named 'sagemaker.huggingface'
Root Cause: SageMaker v3.x restructured modules - HuggingFace integration was removed/moved.
๐ก Analogy: App Store Update
Like buying a new iPhone and finding your favorite app hasn't been updated for the new iOS yet. Rolling back to v2 is like using the "classic" version that still has everything built-in.
Solution
pip3 install "sagemaker>=2.0,<3.0"
Version Comparison
| Version | HuggingFaceModel | Notes |
|---|---|---|
| v3.x | โ Not bundled | Modular architecture |
| v2.x | โ Included | Use this for HuggingFace |
๐ Deploying a Serverless Endpoint
Initial Approach (Failed)
Using S3 path directly:
model = HuggingFaceModel(
model_data="s3://huggingface-sagemaker-models/...", # โ Access denied
...
)
Error: ValidationException: Could not access model data at s3://...
๐ก Analogy: Supplier vs Warehouse
Instead of giving the delivery truck a specific warehouse address that might be outdated, tell them "order directly from the supplier" (HuggingFace Hub) - always fresh and accessible!
Working Solution
Using HuggingFace Hub directly via environment variable:
# sagemaker-test.py - Working deployment script
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.serverless import ServerlessInferenceConfig
role = "arn:aws:iam::609662024349:role/SageMakerExecutionRole"
# Use HuggingFace Hub directly instead of S3
model = HuggingFaceModel(
transformers_version="4.26",
pytorch_version="1.13",
py_version="py39",
role=role,
env={"HF_MODEL_ID": "distilbert-base-uncased-finetuned-sst-2-english"}
)
serverless_config = ServerlessInferenceConfig(
memory_size_in_mb=2048,
max_concurrency=1
)
predictor = model.deploy(serverless_inference_config=serverless_config)
print(f"Endpoint name: {predictor.endpoint_name}")
Deployment Progress
----!
Endpoint name: huggingface-pytorch-inference-2025-12-23-13-07-31-668
What the symbols mean:
- Each
-= Health check in progress != Endpoint is ready!
๐ก Analogy: The food truck is driving to the location, setting up the kitchen, firing up the grill, and flipping the "OPEN" sign.
๐งช Testing the Endpoint
Shell Quoting Lesson Learned
Problem: Inline Python via SSH causes quote escaping nightmares.
๐ก Analogy: Noisy Drive-Through
Instead of shouting a complicated order through a noisy speaker (nested shell quotes), write it on paper first (file), then hand it through the window!
Solution: File Approach
# test-endpoint.py - Inference script
import boto3
import json
runtime = boto3.client("sagemaker-runtime", region_name="us-east-1")
response = runtime.invoke_endpoint(
EndpointName="huggingface-pytorch-inference-2025-12-23-13-07-31-668",
ContentType="application/json",
Body=json.dumps({"inputs": "I love learning AWS!"})
)
print(json.loads(response["Body"].read().decode()))
Test Results
| Input | Label | Score |
|---|---|---|
| "I love learning AWS!" | POSITIVE | 99.95% |
[{"label": "POSITIVE", "score": 0.9995132684707642}]
๐ก The model is like a mood detector - it reads the emotional tone of text and tells you whether it's positive or negative, with a confidence percentage.
๐งน Cleanup & Cost Management
Why Cleanup Matters
๐ก Analogy: Closing the Food Truck
The endpoint is like a food truck parked with the "OPEN" sign on. Even if no customers come, there's a small cost for being ready to serve. Deleting = packing up and leaving!
Cleanup Commands
# 1. Delete endpoint (stops billing)
aws sagemaker delete-endpoint --endpoint-name <endpoint-name>
# 2. Delete endpoint config (free, but keeps things clean)
aws sagemaker delete-endpoint-config --endpoint-config-name <config-name>
# 3. Delete model (free, but keeps things clean)
aws sagemaker delete-model --model-name <model-name>
# 4. Verify everything is gone
aws sagemaker list-endpoints # Should be empty
aws sagemaker list-endpoint-configs # Should be empty
aws sagemaker list-models # Should be empty
What Are These Resources?
| Resource | What It Is | Cost | Analogy |
|---|---|---|---|
| Endpoint | Running inference service | ๐ฐ Charges | The food truck serving |
| Endpoint Config | Blueprint for endpoint setup | Free | Recipe card |
| Model | Registration record pointing to model | Free | Catalog entry |
AWS Billing Note
AWS billing has a 6-24 hour delay. Charges may not appear immediately, but for a few test calls, expect fractions of a penny.
๐ญ Production Workflows
Learning vs Production
| Stage | Source | Code |
|---|---|---|
| Learning | HuggingFace Hub | env={"HF_MODEL_ID": "..."} |
| Production | Your S3 bucket | model_data="s3://your-bucket/model.tar.gz" |
Production Flow
Train model locally/SageMaker
โ
Save/export model (model.tar.gz)
โ
Upload to YOUR S3 bucket
โ
Deploy from S3
๐ก Analogy: Restaurant vs Home Cooking
- Today: Ordered pre-made dish from restaurant (HuggingFace Hub)
- Production: Cook your own recipe, package it, store in your pantry (S3), serve from there
๐ High-Performance Options
Endpoint Type Comparison
| Type | Behavior | Best For |
|---|---|---|
| Serverless | Spins up on-demand, scales to zero | Low traffic, cost-sensitive, dev/test |
| Real-time | Instance runs 24/7 | High throughput, low latency, production |
| Async | Queue-based, for long jobs | Large payloads, batch processing |
Serverless vs Real-time Trade-offs
| Serverless | Real-time | |
|---|---|---|
| Cold start | 10-30 sec first call | None (always warm) |
| Latency | Higher | Lower (~ms) |
| Cost when idle | $0 | Paying 24/7 |
| High traffic | โ | โ |
๐ก Analogy:
- Serverless = Food truck that parks only when you call (cheap but slow to arrive)
- Real-time = Restaurant that's always open (instant service but paying rent 24/7)
High Throughput + Low Latency Solution
Real-time Endpoints with Auto-Scaling
๐ก Analogy: Fleet of Food Trucks
Instead of one truck that shows up when called (serverless), you have a fleet that automatically dispatches more trucks during lunch rush and sends them home when quiet.
# 1. Deploy real-time (not serverless)
predictor = model.deploy(
initial_instance_count=2, # Start with 2 instances
instance_type="ml.m5.large" # Always-on instance type
)
# 2. Add auto-scaling
import boto3
client = boto3.client("application-autoscaling")
# Register scalable target
client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=2,
MaxCapacity=10
)
# Add scaling policy (scale based on invocations)
client.put_scaling_policy(
PolicyName="scale-on-invocations",
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}
)
High-Performance Checklist ๐
When asked "How do you handle throughput & latency in SageMaker?", know these:
| Concept | Remember It As... |
|---|---|
| Endpoint Type | How eager is your service? (Always ready / Wake on call / Queue it) |
| Instance Selection | Brains (CPU) vs Muscle (GPU) - match worker to job |
| Scaling Strategy | When to hire/fire more workers |
| Cooldown Periods | Don't panic-hire or panic-fire |
| Model Optimization | Make the model faster, not just more hardware |
๐ก Analogy: Restaurant Staffing
Running a high-performance ML service is like managing a restaurant - decide if you're 24/7 or pop-up (endpoint), hire cooks vs dishwashers (instance), know when to call in extra staff (scaling), don't overreact to one busy hour (cooldown), and train your staff to work faster (optimization).
๐ง CPU vs GPU Selection
The Confusion
"Why do we need GPU for inference? I thought GPU was only for training."
The Answer
It depends on model size and throughput, not just training vs inference.
| Scenario | CPU | GPU |
|---|---|---|
| Training | โ (too slow) | โ Always |
| Inference - Small model | โ | Overkill |
| Inference - Large model (BERT, GPT) | โ (too slow) | โ |
| Inference - High batch volume | โ | โ |
Quick Decision Guide
| Use CPU | Use GPU |
|---|---|
| Traditional ML (XGBoost, RF) | Deep Learning (Transformers, CNNs) |
| Small models | Large models (100M+ params) |
| Low inference volume | High batch throughput |
| Cost-sensitive | Latency-critical |
Simple Rule: If it's a neural network AND (large OR fast) โ GPU
๐ก Analogy: Pizza Kitchen
Even after you've learned to cook (training), making 100 pizzas at once (inference) still needs industrial ovens (GPU). But making one sandwich? A regular kitchen (CPU) works fine!
๐ Quick Reference
Complete Command Sequence
# 1. Install SDK (use v2 for HuggingFace)
pip3 install "sagemaker>=2.0,<3.0"
# 2. Deploy (see Python script above)
python3 sagemaker-test.py
# 3. Test
python3 test-endpoint.py
# 4. Cleanup
aws sagemaker delete-endpoint --endpoint-name <name>
aws sagemaker delete-endpoint-config --endpoint-config-name <name>
aws sagemaker delete-model --model-name <name>
# 5. Verify
aws sagemaker list-endpoints
aws sagemaker list-endpoint-configs
aws sagemaker list-models
All Sticky Analogies
| Concept | Analogy |
|---|---|
| SageMaker Serverless | Food truck that arrives on-demand |
| IAM Role | ID badge / house key for delivery driver |
| HF Hub vs S3 | Ordering from supplier vs specific warehouse |
| SDK v3 vs v2 | New iPhone missing your favorite app |
| Shell quoting | Noisy drive-through vs written order |
| Endpoint deletion | Closing the food truck |
| Real-time + Auto-scaling | Fleet of food trucks |
| CPU vs GPU | Brains vs Muscle / Regular vs Industrial kitchen |
| Cooldown periods | Don't panic-hire or panic-fire |
Key Takeaways
- Always use SageMaker SDK v2.x for HuggingFace models
- Use
env={"HF_MODEL_ID": ...}instead of S3 paths for learning - Always clean up endpoints after testing
- Serverless = cheap but slow / Real-time = fast but expensive
- GPU for inference only for large models or high throughput
โ What We Accomplished
| Step | Status |
|---|---|
Created IAM Role (SageMakerExecutionRole) |
โ |
| Installed SageMaker SDK (v2) | โ |
| Deployed serverless HuggingFace model | โ |
| Tested sentiment analysis | โ |
| Cleaned up all resources | โ |
| Total cost | $0.00 ๐ |