> For the complete documentation index, see [llms.txt](https://antoniovfranco.gitbook.io/antoniovfranco-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://antoniovfranco.gitbook.io/antoniovfranco-docs/deep-dives/aws-ml-architecture-on-aws.md).

# AWS ML architecture on AWS

## AWS ML architecture on AWS

Use this when you need production ML on AWS.

### End-to-end ML pipeline on AWS

When I design an end-to-end ML pipeline on AWS, I start by mapping out the complete data flow from raw ingestion through to production inference. The architecture typically begins with data landing in S3, which serves as the central storage layer due to its durability and cost-effectiveness. For streaming data, I implement Kinesis Data Streams or Kinesis Firehose depending on throughput requirements and whether I need real-time processing capabilities.

The data processing layer uses a combination of AWS Glue for ETL jobs when dealing with structured transformations, and EMR with Spark for more complex feature engineering at scale. For clients with existing Databricks investments, I integrate Databricks workflows that leverage their unified analytics platform for distributed data processing. I structure this processing to write intermediate results back to S3 in optimized formats like Parquet or ORC, which significantly improves read performance during training while reducing storage costs. The key insight here is that proper data partitioning at this stage dramatically impacts downstream performance.

For training orchestration, I leverage SageMaker Training Jobs when the workload fits standard patterns, but I am not dogmatic about this choice. Sometimes raw EC2 instances with spot pricing make more economic sense, particularly for experimental workloads where interruption tolerance is acceptable. I implement comprehensive experiment tracking using SageMaker Experiments or MLflow, ensuring that every training run logs hyperparameters, metrics, and artifacts in a queryable format.

Model versioning happens through SageMaker Model Registry, where I maintain not just the model artifacts but also approval workflows and deployment metadata. The inference layer typically uses SageMaker Real-time Endpoints for synchronous predictions with autoscaling enabled, though I frequently implement Lambda for simple models where cold start latency is acceptable. For high-throughput batch inference, I use SageMaker Batch Transform Jobs or coordinate EC2 spot instances through AWS Batch, depending on specific latency and cost requirements.

Monitoring runs across the entire pipeline using CloudWatch metrics, with custom dashboards tracking both infrastructure health and model performance metrics like prediction latency, throughput, and accuracy. I set up SNS notifications for anomaly detection, ensuring that any pipeline degradation triggers immediate investigation. The entire architecture is codified using CloudFormation or Terraform, making it reproducible and version-controlled.

### SageMaker Training vs Processing vs EC2

SageMaker Training Jobs provide a fully managed environment optimized specifically for model training. When I use them, I benefit from built-in support for distributed training, automatic model artifact uploading to S3, and seamless integration with SageMaker Experiments for tracking. The pricing model charges per second of compute time, which works well for predictable training workloads. I choose Training Jobs when I need straightforward distributed training across multiple instances, when the workload fits standard framework containers like PyTorch or TensorFlow, or when integration with the broader SageMaker ecosystem justifies the slight cost premium over raw compute.

SageMaker Processing Jobs serve a different purpose entirely. These are designed for data preprocessing, feature engineering, and model evaluation rather than training. I use Processing Jobs when I need to run data transformation scripts at scale before training begins, or when performing batch scoring for model validation. The main advantage here is the ability to spin up large clusters temporarily just for processing, then terminate them immediately, avoiding any idle resource costs.

EC2 instances give me maximum flexibility but require more infrastructure management. I choose EC2 when I need custom configurations that do not fit SageMaker's patterns, when I am running long-lived training experiments that benefit from persistent instances, or when cost optimization through spot instances is the primary concern. With EC2, I can implement custom training loops, experiment with cutting-edge libraries that might not yet have SageMaker container support, or run complex multi-stage pipelines that do not map cleanly to SageMaker's job-based model.

The decision matrix comes down to several factors. For standard training workloads under 24 hours with well-supported frameworks, SageMaker Training Jobs typically win on operational simplicity. For large-scale data processing that needs to scale out temporarily, Processing Jobs make the most sense. For experimental research, long-running training that exceeds several days, or workloads requiring deep customization, EC2 instances with spot pricing often provide the best combination of flexibility and cost efficiency. In practice, most of my production pipelines use a hybrid approach, leveraging each service where it provides maximum value.

### Multi-region inference with failover

Setting up multi-region inference requires careful planning around data residency, latency requirements, and consistency models. I begin by deploying the same model to SageMaker endpoints in at least two geographically separated regions, ensuring that both deployments use identical model artifacts and container configurations. The model artifacts themselves are replicated across regions using S3 cross-region replication, which provides automatic and transparent data synchronization.

The traffic routing layer sits behind Route 53, where I configure health checks that actively monitor each regional endpoint by sending test inference requests every few seconds. These health checks verify not just that the endpoint responds, but that it responds with acceptable latency and valid predictions. If a health check fails in one region, Route 53 automatically redirects traffic to healthy regions based on configurable routing policies. I typically use latency-based routing under normal conditions to send users to their closest region, with automatic failover to geographically distant regions only when necessary.

For stateful components like feature stores or real-time feature computation, I use DynamoDB Global Tables, which provide active-active replication with conflict resolution. For clients with existing database infrastructure, I integrate with PostgreSQL using read replicas across regions, MySQL with master-master replication for specific use cases, or MongoDB with replica sets for document-based feature storage. The database choice depends on access patterns, consistency requirements, and existing infrastructure investments. This ensures that any region can serve inference requests using locally available feature data, eliminating cross-region dependencies that could introduce latency or failure points. When real-time feature computation is required, I deploy Lambda functions in each region that can compute features locally before invoking the inference endpoint.

The monitoring and alerting system must operate across regions. I aggregate CloudWatch metrics from all regions into a central monitoring account, using cross-account metric publishing. This provides a unified view of system health while allowing region-specific alarms to trigger appropriate responses. The failover system includes automated runbooks in Systems Manager that can quickly redirect traffic, scale up capacity in healthy regions, or trigger incident response procedures.

Testing the failover system is critical. I regularly conduct chaos engineering exercises where I deliberately fail entire regional deployments to verify that traffic shifts smoothly, that performance remains acceptable under degraded conditions, and that monitoring systems correctly detect and alert on the failure. These tests also validate that cost management remains effective even when operating in degraded mode with reduced geographic distribution.

### Blue-green deployments for SageMaker endpoints

Blue-green deployment for ML models requires maintaining two complete production environments and shifting traffic between them in a controlled manner. In SageMaker, I implement this using endpoint configurations and traffic splitting capabilities. The process begins with the current production model running on a SageMaker endpoint, which represents the blue environment serving 100% of production traffic.

When I am ready to deploy a new model version, I create a new endpoint configuration that includes both the existing blue variant and the new green variant. Initially, the green variant receives zero traffic while I verify its functionality through direct invocation using the production environment. This allows me to validate that the new model loads correctly, responds within acceptable latency bounds, and produces reasonable predictions without affecting any live traffic.

The gradual traffic shift happens through endpoint configuration updates. I start by routing a small percentage of traffic to the green variant, typically 5% initially, while carefully monitoring prediction quality, latency, and error rates through CloudWatch dashboards. If metrics remain healthy after a predetermined observation period, I incrementally increase the traffic percentage to the green variant in steps of 10-20%, allowing time for metric validation at each step.

During this process, I monitor not just infrastructure metrics but also business metrics when available. Latency distributions, prediction confidence scores, and any business-specific KPIs all feed into the decision to continue or rollback. The advantage of this approach is that rollback remains trivial at any point by simply adjusting the traffic split back to 100% blue. The old model version stays warm and ready to handle traffic throughout the entire deployment process.

Once the green variant handles 100% of traffic successfully for a validation period, typically 24-48 hours in production, I consider the deployment complete. However, I do not immediately delete the blue variant. Instead, I maintain it in a scaled-down configuration for at least a week, providing an instant rollback path if any delayed issues emerge. Only after this extended stability period do I fully decommission the old model version.

The entire deployment process is automated through CI/CD pipelines using AWS CodePipeline and CodeDeploy, with manual approval gates at critical percentage thresholds. This automation ensures consistency across deployments while maintaining human oversight at decision points where business judgment matters.

### Sub-100ms real-time inference

Achieving sub-100ms latency requires aggressive optimization at every layer of the inference stack. I start by selecting instance types with the right balance of CPU, memory, and network bandwidth for the specific workload. For transformer-based models, I typically use GPU instances like g4dn or p4d, though for smaller models, CPU instances with AVX512 support often provide better cost-performance at lower scale.

Model optimization begins before deployment. I apply quantization techniques, often using int8 or mixed precision representations that maintain accuracy while dramatically reducing memory bandwidth requirements. For many models, I can achieve 2-4x speedup through quantization alone. I also implement model distillation when appropriate, training smaller student models that approximate larger teacher models with significantly reduced computational requirements.

The inference runtime matters enormously. For GPU inference, I frequently use vLLM which provides exceptional throughput through continuous batching and PagedAttention mechanisms. vLLM consistently outperforms standard framework inference by 10-20x on throughput while maintaining low latency. For traditional models, I use TensorRT when working with NVIDIA hardware, as it provides substantial performance improvements through layer fusion and kernel auto-tuning. For CPU inference, I leverage ONNX Runtime with appropriate optimization flags, which consistently outperforms standard framework inference paths. The choice of runtime can easily make the difference between meeting and missing latency targets.

The serving layer typically uses FastAPI for building production inference APIs due to its async capabilities, automatic OpenAPI documentation, and excellent performance characteristics. FastAPI's dependency injection system simplifies implementing auth, rate limiting, and request validation while maintaining clean, maintainable code. The framework's native async support enables handling high concurrency without threading overhead.

Batching strategy requires careful tuning. While larger batches improve throughput, they increase latency for any individual request waiting in the batch. I implement dynamic batching with maximum wait times configured based on the latency budget, typically setting maximum batch wait times to 20-30ms when targeting sub-100ms total latency. This requires careful load testing to find the sweet spot where throughput maximizes without violating latency constraints.

Network optimization cannot be overlooked. I deploy endpoints in the same VPC as calling services to avoid internet gateway latency, use VPC endpoints for AWS service calls to reduce network hops, and enable Enhanced Networking on EC2 instances for lower latency and higher packet-per-second performance. For extremely latency-sensitive applications, I use placement groups to ensure physical proximity of instances, though this sacrifices some fault tolerance.

Caching provides another critical optimization layer. I implement feature caching in ElastiCache when features can be precomputed or reused across predictions, and I cache model outputs when appropriate for deterministic models. This can eliminate the inference path entirely for repeated requests, though cache invalidation strategies must align with model update cadences.

Monitoring sub-100ms latency requires high-resolution metrics. I use CloudWatch high-resolution metrics at 1-second granularity, track latency percentiles not just averages, and implement detailed logging that captures the breakdown of latency across each pipeline stage. This visibility allows rapid diagnosis when latency degrades, identifying whether the issue lies in model inference, feature retrieval, network transport, or other components.

### AWS Lambda for ML inference

Lambda provides an attractive serverless option for ML inference when workloads fit within its constraints. I have deployed numerous models on Lambda, particularly for sporadic inference workloads where maintaining always-on infrastructure would be wasteful. The cold start challenge is real but manageable through several strategies. I use provisioned concurrency for latency-critical applications, effectively pre-warming Lambda instances to eliminate cold starts entirely at the cost of continuous billing for the provisioned capacity.

The memory and timeout limitations require careful consideration. Lambda functions max out at 10GB of memory and 15 minutes of execution time. For many inference workloads, particularly with smaller models under 1-2GB, these limits prove sufficient. I work within the memory constraint by using heavily quantized models and optimized inference runtimes. When models exceed Lambda's capacity, I implement model splitting strategies where I break larger models into components that can execute in separate Lambda functions, though this introduces orchestration complexity.

Package size restrictions present another challenge. The 50MB compressed deployment package limit and 250MB uncompressed size mean that model artifacts and dependencies must fit within tight bounds. I handle this by storing model weights in S3 and downloading them during initialization, caching them in /tmp for reuse across warm invocations. For larger models, I use Lambda layers to package dependencies separately, maximizing the available space for model artifacts.

Lambda particularly excels for asynchronous inference workloads. I frequently use Lambda with SQS queues for batch prediction tasks where immediate response is not required. The SQS queue buffers requests, Lambda scales automatically to process them in parallel, and results write back to S3 or DynamoDB. This pattern handles variable workloads elegantly without requiring capacity planning or infrastructure management.

The cost model favors Lambda for low-traffic inference. When request volumes remain under a few hundred per day, Lambda's pay-per-invocation pricing beats maintaining even a small SageMaker endpoint. However, the cost equation flips quickly as traffic increases. I typically find that beyond roughly 100,000 invocations per month, dedicated inference infrastructure becomes more economical, though the exact crossover point depends on the specific workload and latency requirements.

### Continuous retraining pipeline

Continuous model retraining requires orchestrating data ingestion, preprocessing, training, evaluation, and deployment in an automated pipeline that responds to new data availability. I structure this using Step Functions to coordinate the workflow, as it provides visual monitoring, error handling, and the ability to implement complex conditional logic.

The pipeline begins with EventBridge rules that trigger when new data arrives in S3. These events kick off the Step Functions workflow, passing relevant metadata about the newly available data. The first stage validates data quality using Lambda functions that check schema compliance, value distributions, and data volume thresholds. Only when validation passes does the pipeline proceed to preprocessing.

Preprocessing runs as a SageMaker Processing Job that reads raw data from S3, applies feature engineering transformations, performs train-test splitting, and writes processed datasets back to S3 in optimized formats. I maintain versioning for processed datasets, ensuring reproducibility and enabling comparison of models trained on different data versions. The processing job also computes and logs data statistics that inform subsequent training decisions.

Training itself uses SageMaker Training Jobs configured through parameters determined by the Step Functions workflow. I implement hyperparameter optimization through integration with SageMaker Automatic Model Tuning when the retraining cycle allows sufficient time, though for faster iteration I often use previously optimized hyperparameters with minor adjustments based on data drift metrics. The training job saves model artifacts to S3 with comprehensive metadata about the training data version, hyperparameters, and resulting metrics.

Model evaluation happens automatically after training completes. I run a SageMaker Processing Job that loads both the new model and the current production model, evaluates them on a holdout test set, and compares their performance across multiple metrics. The evaluation job writes detailed comparison reports to S3 and publishes summary metrics to CloudWatch. The Step Functions workflow implements conditional logic that determines whether the new model should proceed to deployment based on these comparison metrics.

When the new model meets deployment criteria, the workflow triggers model registration in SageMaker Model Registry with approval status pending. If I have configured automatic deployment for this pipeline, the workflow proceeds to update the production endpoint using the blue-green deployment strategy described earlier. If manual approval is required, the workflow sends an SNS notification and pauses until an operator approves or rejects the deployment.

The entire pipeline logs comprehensive audit trails of each execution, maintaining records of which data trained which models, what evaluation metrics resulted, and what deployment decisions were made. This audit trail proves invaluable for debugging when model performance degrades and for regulatory compliance in domains requiring explainability of model updates.

### Model versioning and experiment tracking

Model versioning and experiment tracking form the foundation of reproducible machine learning operations. I use SageMaker Model Registry as the central repository for model versions, treating it as the source of truth for what models exist, their deployment status, and their approval workflow state. Every model that trains, whether through manual experimentation or automated retraining, gets registered with comprehensive metadata.

The registration process captures not just the model artifacts themselves but also the complete context needed to reproduce or understand the model. This includes the training data version, the exact code version used for training, all hyperparameters, training metrics, evaluation metrics, and links to the experiment tracking system where full training logs reside. I implement this registration automatically through training job completion hooks, ensuring that manual registration never gets forgotten.

For experiment tracking during development, I use MLflow deployed on EC2 with backing storage in RDS and S3. MLflow provides the flexibility to track arbitrary metrics, parameters, and artifacts while offering a user-friendly interface for comparing experiments. Development workflows typically use Jupyter notebooks for rapid prototyping and exploratory analysis, with production code refactored into Python modules and scripts. I integrate MLflow with SageMaker Training Jobs through custom training scripts that log to both systems, providing unified tracking across local development and cloud training.

Version control extends beyond models to include datasets, code, and infrastructure. I maintain dataset versions in S3 with appropriate metadata tags, use Git for all code versioning with tags marking production releases, and version infrastructure as code through Terraform or CloudFormation tracked in Git. The combination of these versioning systems provides complete reproducibility. Given any model version, I can trace back to the exact dataset version, code commit, and infrastructure configuration that produced it.

The challenge comes in managing the proliferation of versions over time. I implement retention policies that automatically clean up old experiment artifacts while preserving metadata about their existence. For models, I maintain indefinitely the currently deployed production model plus the previous two versions to enable quick rollback. Older versions get archived to Glacier unless they represent significant milestones worth preserving for historical analysis.

Integration between versioning systems happens through consistent use of identifiers. Every training run gets assigned a unique identifier that appears in MLflow experiments, SageMaker model names, Git tags, and CloudWatch log groups. This identifier provides the thread connecting all artifacts related to a particular experiment, making it possible to reconstruct the complete picture of any model version even months or years after creation.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://antoniovfranco.gitbook.io/antoniovfranco-docs/deep-dives/aws-ml-architecture-on-aws.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.