Getting Started with SageMaker Serverless Endpoints

December 19, 2025
AWSSageMakerMachine LearningMLOpsHuggingFace

๐Ÿง  SageMaker Serverless Exploration - Complete Summary

Total Cost: $0.00 ๐ŸŽ‰

๐Ÿ“‹ Table of Contents

  1. Architecture Overview
  2. IAM Role Setup
  3. SageMaker SDK Setup
  4. Deploying a Serverless Endpoint
  5. Testing the Endpoint
  6. Cleanup & Cost Management
  7. Production Workflows
  8. High-Performance Options
  9. CPU vs GPU Selection
  10. Quick Reference

๐Ÿ’ก Sticky Analogy: Food Truck Service

Think of SageMaker Serverless as ordering a food truck on-demand:

  • IAM Role = Your ID badge proving you're allowed to order
  • HuggingFaceModel = The menu item you're ordering
  • ServerlessInferenceConfig = Delivery preferences (memory, concurrency)
  • model.deploy() = Actually placing the order
  • Endpoint = The food truck arrives and flips the "OPEN" sign

๐Ÿ” IAM Role Setup

Why We Need It

SageMaker needs permission to access S3, ECR, and other AWS services on your behalf.

๐Ÿ’ก Analogy: Like giving a delivery driver your house key to drop off packages while you're away.

What We Created

  • Role Name: SageMakerExecutionRole
  • ARN: arn:aws:iam::609662024349:role/SageMakerExecutionRole
  • Trust Policy: Allows sagemaker.amazonaws.com to assume the role
  • Permission Policy: AmazonSageMakerFullAccess

Trust Policy JSON

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

๐Ÿ“ฆ SageMaker SDK Setup

Version Compatibility Issue

Problem: ModuleNotFoundError: No module named 'sagemaker.huggingface'

Root Cause: SageMaker v3.x restructured modules - HuggingFace integration was removed/moved.

๐Ÿ’ก Analogy: App Store Update

Like buying a new iPhone and finding your favorite app hasn't been updated for the new iOS yet. Rolling back to v2 is like using the "classic" version that still has everything built-in.

Solution

pip3 install "sagemaker>=2.0,<3.0"

Version Comparison

Version HuggingFaceModel Notes
v3.x โŒ Not bundled Modular architecture
v2.x โœ… Included Use this for HuggingFace

๐Ÿš€ Deploying a Serverless Endpoint

Initial Approach (Failed)

Using S3 path directly:

model = HuggingFaceModel(
    model_data="s3://huggingface-sagemaker-models/...",  # โŒ Access denied
    ...
)

Error: ValidationException: Could not access model data at s3://...

๐Ÿ’ก Analogy: Supplier vs Warehouse

Instead of giving the delivery truck a specific warehouse address that might be outdated, tell them "order directly from the supplier" (HuggingFace Hub) - always fresh and accessible!

Working Solution

Using HuggingFace Hub directly via environment variable:

# sagemaker-test.py - Working deployment script
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.serverless import ServerlessInferenceConfig

role = "arn:aws:iam::609662024349:role/SageMakerExecutionRole"

# Use HuggingFace Hub directly instead of S3
model = HuggingFaceModel(
    transformers_version="4.26",
    pytorch_version="1.13",
    py_version="py39",
    role=role,
    env={"HF_MODEL_ID": "distilbert-base-uncased-finetuned-sst-2-english"}
)

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=1
)

predictor = model.deploy(serverless_inference_config=serverless_config)
print(f"Endpoint name: {predictor.endpoint_name}")

Deployment Progress

----!
Endpoint name: huggingface-pytorch-inference-2025-12-23-13-07-31-668

What the symbols mean:

  • Each - = Health check in progress
  • ! = Endpoint is ready!

๐Ÿ’ก Analogy: The food truck is driving to the location, setting up the kitchen, firing up the grill, and flipping the "OPEN" sign.


๐Ÿงช Testing the Endpoint

Shell Quoting Lesson Learned

Problem: Inline Python via SSH causes quote escaping nightmares.

๐Ÿ’ก Analogy: Noisy Drive-Through

Instead of shouting a complicated order through a noisy speaker (nested shell quotes), write it on paper first (file), then hand it through the window!

Solution: File Approach

# test-endpoint.py - Inference script
import boto3
import json

runtime = boto3.client("sagemaker-runtime", region_name="us-east-1")

response = runtime.invoke_endpoint(
    EndpointName="huggingface-pytorch-inference-2025-12-23-13-07-31-668",
    ContentType="application/json",
    Body=json.dumps({"inputs": "I love learning AWS!"})
)

print(json.loads(response["Body"].read().decode()))

Test Results

Input Label Score
"I love learning AWS!" POSITIVE 99.95%
[{"label": "POSITIVE", "score": 0.9995132684707642}]

๐Ÿ’ก The model is like a mood detector - it reads the emotional tone of text and tells you whether it's positive or negative, with a confidence percentage.


๐Ÿงน Cleanup & Cost Management

Why Cleanup Matters

๐Ÿ’ก Analogy: Closing the Food Truck

The endpoint is like a food truck parked with the "OPEN" sign on. Even if no customers come, there's a small cost for being ready to serve. Deleting = packing up and leaving!

Cleanup Commands

# 1. Delete endpoint (stops billing)
aws sagemaker delete-endpoint --endpoint-name <endpoint-name>

# 2. Delete endpoint config (free, but keeps things clean)
aws sagemaker delete-endpoint-config --endpoint-config-name <config-name>

# 3. Delete model (free, but keeps things clean)
aws sagemaker delete-model --model-name <model-name>

# 4. Verify everything is gone
aws sagemaker list-endpoints           # Should be empty
aws sagemaker list-endpoint-configs    # Should be empty
aws sagemaker list-models              # Should be empty

What Are These Resources?

Resource What It Is Cost Analogy
Endpoint Running inference service ๐Ÿ’ฐ Charges The food truck serving
Endpoint Config Blueprint for endpoint setup Free Recipe card
Model Registration record pointing to model Free Catalog entry

AWS Billing Note

AWS billing has a 6-24 hour delay. Charges may not appear immediately, but for a few test calls, expect fractions of a penny.


๐Ÿญ Production Workflows

Learning vs Production

Stage Source Code
Learning HuggingFace Hub env={"HF_MODEL_ID": "..."}
Production Your S3 bucket model_data="s3://your-bucket/model.tar.gz"

Production Flow

Train model locally/SageMaker
        โ†“
Save/export model (model.tar.gz)
        โ†“
Upload to YOUR S3 bucket
        โ†“
Deploy from S3

๐Ÿ’ก Analogy: Restaurant vs Home Cooking

  • Today: Ordered pre-made dish from restaurant (HuggingFace Hub)
  • Production: Cook your own recipe, package it, store in your pantry (S3), serve from there

๐Ÿš€ High-Performance Options

Endpoint Type Comparison

Type Behavior Best For
Serverless Spins up on-demand, scales to zero Low traffic, cost-sensitive, dev/test
Real-time Instance runs 24/7 High throughput, low latency, production
Async Queue-based, for long jobs Large payloads, batch processing

Serverless vs Real-time Trade-offs

Serverless Real-time
Cold start 10-30 sec first call None (always warm)
Latency Higher Lower (~ms)
Cost when idle $0 Paying 24/7
High traffic โŒ โœ…

๐Ÿ’ก Analogy:

  • Serverless = Food truck that parks only when you call (cheap but slow to arrive)
  • Real-time = Restaurant that's always open (instant service but paying rent 24/7)

High Throughput + Low Latency Solution

Real-time Endpoints with Auto-Scaling

๐Ÿ’ก Analogy: Fleet of Food Trucks

Instead of one truck that shows up when called (serverless), you have a fleet that automatically dispatches more trucks during lunch rush and sends them home when quiet.

# 1. Deploy real-time (not serverless)
predictor = model.deploy(
    initial_instance_count=2,      # Start with 2 instances
    instance_type="ml.m5.large"    # Always-on instance type
)

# 2. Add auto-scaling
import boto3

client = boto3.client("application-autoscaling")

# Register scalable target
client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=2,
    MaxCapacity=10
)

# Add scaling policy (scale based on invocations)
client.put_scaling_policy(
    PolicyName="scale-on-invocations",
    ServiceNamespace="sagemaker",
    ResourceId=f"endpoint/{endpoint_name}/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 70.0,
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleOutCooldown": 60,
        "ScaleInCooldown": 300
    }
)

High-Performance Checklist ๐Ÿ“‹

When asked "How do you handle throughput & latency in SageMaker?", know these:

Concept Remember It As...
Endpoint Type How eager is your service? (Always ready / Wake on call / Queue it)
Instance Selection Brains (CPU) vs Muscle (GPU) - match worker to job
Scaling Strategy When to hire/fire more workers
Cooldown Periods Don't panic-hire or panic-fire
Model Optimization Make the model faster, not just more hardware

๐Ÿ’ก Analogy: Restaurant Staffing

Running a high-performance ML service is like managing a restaurant - decide if you're 24/7 or pop-up (endpoint), hire cooks vs dishwashers (instance), know when to call in extra staff (scaling), don't overreact to one busy hour (cooldown), and train your staff to work faster (optimization).


๐Ÿง  CPU vs GPU Selection

The Confusion

"Why do we need GPU for inference? I thought GPU was only for training."

The Answer

It depends on model size and throughput, not just training vs inference.

Scenario CPU GPU
Training โŒ (too slow) โœ… Always
Inference - Small model โœ… Overkill
Inference - Large model (BERT, GPT) โŒ (too slow) โœ…
Inference - High batch volume โŒ โœ…

Quick Decision Guide

Use CPU Use GPU
Traditional ML (XGBoost, RF) Deep Learning (Transformers, CNNs)
Small models Large models (100M+ params)
Low inference volume High batch throughput
Cost-sensitive Latency-critical

Simple Rule: If it's a neural network AND (large OR fast) โ†’ GPU

๐Ÿ’ก Analogy: Pizza Kitchen

Even after you've learned to cook (training), making 100 pizzas at once (inference) still needs industrial ovens (GPU). But making one sandwich? A regular kitchen (CPU) works fine!


๐Ÿ“š Quick Reference

Complete Command Sequence

# 1. Install SDK (use v2 for HuggingFace)
pip3 install "sagemaker>=2.0,<3.0"

# 2. Deploy (see Python script above)
python3 sagemaker-test.py

# 3. Test
python3 test-endpoint.py

# 4. Cleanup
aws sagemaker delete-endpoint --endpoint-name <name>
aws sagemaker delete-endpoint-config --endpoint-config-name <name>
aws sagemaker delete-model --model-name <name>

# 5. Verify
aws sagemaker list-endpoints
aws sagemaker list-endpoint-configs
aws sagemaker list-models

All Sticky Analogies

Concept Analogy
SageMaker Serverless Food truck that arrives on-demand
IAM Role ID badge / house key for delivery driver
HF Hub vs S3 Ordering from supplier vs specific warehouse
SDK v3 vs v2 New iPhone missing your favorite app
Shell quoting Noisy drive-through vs written order
Endpoint deletion Closing the food truck
Real-time + Auto-scaling Fleet of food trucks
CPU vs GPU Brains vs Muscle / Regular vs Industrial kitchen
Cooldown periods Don't panic-hire or panic-fire

Key Takeaways

  1. Always use SageMaker SDK v2.x for HuggingFace models
  2. Use env={"HF_MODEL_ID": ...} instead of S3 paths for learning
  3. Always clean up endpoints after testing
  4. Serverless = cheap but slow / Real-time = fast but expensive
  5. GPU for inference only for large models or high throughput

โœ… What We Accomplished

Step Status
Created IAM Role (SageMakerExecutionRole) โœ…
Installed SageMaker SDK (v2) โœ…
Deployed serverless HuggingFace model โœ…
Tested sentiment analysis โœ…
Cleaned up all resources โœ…
Total cost $0.00 ๐ŸŽ‰