# Debugging and problem solving

## Debugging and problem solving

### Reducing inference latency in production

I encountered a latency challenge with a production text classification system serving recommendations in a content platform. The system needed to classify thousands of documents per second with p95 latency under 50ms, but was consistently exceeding 100ms, causing noticeable delays in user experience.

My initial investigation focused on profiling the inference pipeline to identify bottlenecks. I instrumented the code to measure time spent in different stages including input preprocessing, model inference, and output formatting. The profiling revealed that model inference itself consumed about 40ms, but preprocessing added another 50ms, and there was an additional 20ms overhead from Python's GIL contention in the serving framework.

The preprocessing bottleneck stemmed from tokenization and feature extraction implemented in pure Python. I rewrote these operations using vectorized NumPy operations and custom Cython code for the most expensive operations. This optimization reduced preprocessing time from 50ms to about 8ms, nearly an order of magnitude improvement.

For the model inference optimization, I experimented with several approaches. First, I quantized the model from FP32 to INT8 using post-training quantization. This reduced model size by 75% and inference time by about 40%, bringing average inference from 40ms down to 24ms. The accuracy degradation was minimal, less than 0.5%, which proved acceptable for this application.

I also implemented dynamic batching with a 5ms timeout. Rather than processing requests individually, the serving framework accumulated requests into small batches, processing them together on GPU. This improved throughput substantially but introduced a latency-throughput tradeoff.

The Python GIL contention required rethinking the serving architecture. I migrated from a Python-based serving framework to Triton Inference Server, which uses C++ for request handling and avoids GIL contention. This change eliminated about 15ms of overhead from the request processing pipeline.

ONNX Runtime provided another significant optimization. I exported the PyTorch model to ONNX format and used ONNX Runtime for inference. The ONNX Runtime's optimizations including operator fusion, layout transformation, and graph optimization reduced inference time by another 30% compared to native PyTorch.

The final optimization involved infrastructure changes. I moved from CPU-based inference on general-purpose instances to GPU instances with shared serving across multiple models. The GPU's massive parallelism handled the classification model trivially, with inference time dropping to under 5ms.

After all optimizations, the end-to-end latency dropped from 100ms+ to consistent 18-22ms at p95.

### Debugging a fine-tuned model in production

A particularly challenging debugging experience involved a fine-tuned summarization model that suddenly began producing degraded outputs in production after several months of stable operation. The model would occasionally generate summaries that were incoherent, repetitive, or completely off-topic, though most outputs remained acceptable.

The intermittent nature made debugging difficult. I could not consistently reproduce the issue in testing environments. Roughly 2-3% of production requests showed degradation, but the same inputs processed again often produced correct outputs.

I began by analyzing the problematic outputs systematically. I implemented extensive logging to capture full inputs, outputs, and intermediate states for requests identified as problematic through user reports or automated quality scoring. Analysis revealed that problematic outputs tended to be longer than typical summaries and showed repetitive patterns.

The repetition pattern suggested the model was getting stuck in generation loops. I examined the generation parameters and discovered that the production deployment was using top-p sampling with p=0.92. Reducing to p=0.85 dramatically reduced the occurrence of repetitive outputs.

Further investigation revealed that the problematic cases correlated with unusual input characteristics. Inputs with very long documents or documents containing specific formatting artifacts were more likely to trigger issues.

I implemented improved input validation and sanitization. The preprocessing pipeline now removes problematic formatting, truncates excessively long inputs more gracefully, and normalizes whitespace and special characters more aggressively.

The remaining issues traced to model configuration drift. Comparing the production deployment configuration against the training configuration revealed several discrepancies introduced during deployment optimizations.

Restoring strict alignment between training and inference configurations eliminated most remaining issues. I implemented automated configuration validation that verifies production deployments match training configurations exactly unless differences are explicitly approved and documented.

The final issue traced to a subtle concurrency bug in the custom serving code. Multiple threads were sharing a model state that should have been thread-local.

Fixing the thread safety issue eliminated the last class of problematic outputs.

### Migrating ML systems to AWS

I have guided several ML system migrations to AWS from various starting points. One significant project involved migrating an on-premise GPU cluster running model training and inference to AWS infrastructure.

The primary technical challenge was data migration. The on-premise system had accumulated roughly 200TB of training data, model checkpoints, and experiment artifacts over the years. I used AWS Snowball devices to physically ship the bulk data to AWS.

Dependency management proved more complex than anticipated. The on-premise system had evolved organically over years, accumulating customized library versions, patches, and configurations that were poorly documented. I ultimately built Docker containers that captured the environment.

Cost optimization required careful architecture decisions. I redesigned workloads to leverage spot instances for training, reserved instances for steady-state inference, and Lambda for lightweight inference workloads.

Network configuration challenged the team unfamiliar with VPC networking. I implemented network segmentation with separate subnets for different workload types.

Monitoring and observability migration was substantial. I migrated monitoring to CloudWatch and rebuilt dashboards using CloudWatch metrics.

Identity and access management required redesigning authentication and authorization. I implemented IAM roles and policies, service accounts for applications, and MFA for human users.

Phased migration minimized disruption. We maintained parallel operation of on-premise and AWS systems during transition.

### Large cost optimization program

The most complex cost optimization project I led involved a machine learning platform supporting dozens of research teams with varied workloads. Monthly AWS spending had grown to roughly 180,000 USD per month.

I began with cost attribution analysis. I implemented comprehensive tagging policies requiring all resources to be tagged with cost center, project, and environment.

The largest single optimization involved redesigning the training infrastructure. I migrated training to a hybrid model using spot instances through AWS Batch.

Inference optimization involved consolidating endpoints using multi-model endpoints and migrating low-traffic inference to Lambda.

Storage optimization used S3 lifecycle policies. Instance right-sizing and Savings Plans improved baseline costs.

Final results reduced monthly costs from 180,000 USD to 98,000 USD.
