Building a Production-Ready Medical AI System with ONNX

TL;DR: I built an end-to-end skin lesion classification system that achieved a macro F1 score of 0.6587 while reducing model size from 41 MB to 0.9 MB through ONNX optimization. This post covers the model development process, deployment architecture, and lessons learned while preparing the system for real-world inference.

The Problem: Why I Built This

Skin cancer affects millions of people yearly, yet access to dermatologists is geographically limited. Computer vision models can help with early detection, but there's a significant gap between research papers and production systems that actually work.

Many machine learning projects stop after achieving acceptable validation metrics. In practice, however, deployment, optimization, monitoring, and maintainability are often the harder engineering problems.

I wanted to bridge that gap. The goal: build a production-grade system that runs on edge devices (laptops, mobile, embedded systems), not just GPU clusters.

The Architecture: Making Smart Trade-offs

Choosing the Right Backbone

I evaluated several architectures before settling on EfficientNet-B3:

Architecture	Accuracy	Size	Trade-off
ResNet-50	High	102 MB	Too heavy for edge
MobileNet-V3	Medium	12 MB	Too small for medical data
EfficientNet-B3	High	41 MB	Sweet spot
EfficientNet-B4	Very High	70 MB	Overkill for this task

EfficientNet scales depth, width, and resolution in a principled way. B3 offered the best trade-off between accuracy and deployability.

Model Architecture:
├─ EfficientNet-B3 (pretrained on ImageNet)
├─ 10.7M parameters
├─ Input: 300×300 RGB images
└─ Output: 9-class logits (skin lesion types)

The Training Pipeline: Systematic Experimentation

The Class Imbalance Problem

Real medical datasets are deeply imbalanced. The ISIC dataset I used had severe disparities:

Melanocytic nevi:      ~2,000 samples  (common)
Melanoma:              ~1,100 samples  (critical)
Vascular lesions:      ~150 samples    (rare)

Solution: Per-class weighted loss

# Calculate weights inversely proportional to class frequency
weights = total_samples / (num_classes × samples_per_class)

# Apply in loss function
criterion = nn.CrossEntropyLoss(
    weight=torch.tensor(weights),
    label_smoothing=0.1  # Prevent overconfidence
)

This simple change improved macro F1 from 0.46 → 0.53.

The Ablation Study: Finding the Real Bottleneck

I ran systematic experiments to isolate what actually matters:

Experiment	Macro F1	Key Finding
Baseline (EfficientNet-B0, no augmentation)	0.46	The model overfits badly
+ Data augmentation (flip, rotate, color jitter)	0.51	Helps significantly
+ Class weighting only	0.45	Can conflict with augmentation
+ Both augmentation + weights	0.53	Synergistic but plateaus
+ Learning rate scheduler + EfficientNet-B3	0.6587	Best-performing configuration

One of the most impactful improvements came from introducing adaptive learning rate scheduling.

# Learning-rate scheduling with ReduceLROnPlateau
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',
    factor=0.1,      # Reduce LR by 10× when stuck
    patience=3,      # Wait 3 epochs before reducing
    verbose=True
)

# In training loop
for epoch in range(num_epochs):
    train_loss = train_epoch()
    val_loss = validate_epoch()
    scheduler.step(val_loss)  # Adaptive learning rate

This helped the model continue improving after validation loss plateaued. Combined with early stopping (patience=7), the model converged by epoch 9 instead of requiring extended training.

ONNX Export & Optimization

This stage is often overlooked in many ML projects. They train on GPU, deploy on GPU, and call it production-ready.

In a production environment, you need inference on CPU, mobile devices, embedded systems, and web browsers.

The Export Process

import torch
import onnx

# 1. Load trained model and set to eval mode
model.to('cpu')  # CRITICAL: CPU, not GPU
model.eval()

# 2. Create dummy input matching model expectations
dummy_input = torch.randn(1, 3, 300, 300, device='cpu')

# 3. Export to ONNX format
torch.onnx.export(
    model,
    dummy_input,
    "skin_classifier_v1.onnx",
    input_names=["input_image"],
    output_names=["logits"],
    opset_version=14,  # Good balance of ops and compatibility
    do_constant_folding=True,
    dynamic_axes={
        "input_image": {0: 'batch_size'},
        "logits": {0: 'batch_size'}
    }
)

# 4. Validate ONNX model structure
onnx_model = onnx.load("skin_classifier_v1.onnx")
onnx.checker.check_model(onnx_model)
print(" ONNX model is valid")

Numerical Validation

Here's the critical part most people skip:

import onnxruntime as ort
import numpy as np

# Load ONNX session
ort_session = ort.InferenceSession(
    "skin_classifier_v1.onnx",
    providers=['CPUExecutionProvider']
)

# Compare PyTorch vs ONNX on random inputs
for i in range(5):
    test_input = torch.randn(1, 3, 300, 300, device='cpu')
    
    # PyTorch inference
    with torch.no_grad():
        pytorch_output = model(test_input).detach().numpy()
    
    # ONNX inference
    onnx_output = ort_session.run(
        None, 
        {"input_image": test_input.numpy()}
    )[0]
    
    # Check equivalence
    max_diff = np.abs(pytorch_output - onnx_output).max()
    print(f"Test {i+1}: max diff = {max_diff:.6f}")
    assert max_diff < 1e-5  # Numerical equivalence

Result: Maximum difference of 4e-6 across all tests. This validation caught subtle bugs I would've missed in production.

The Impact: Numbers That Matter

PyTorch Model:       41 MB
ONNX Model:          0.9 MB
━━━━━━━━━━━━━━━━━━━━━━━━
Compression:         45× smaller 

Inference Latency (CPU):
• PyTorch:           ~120 ms/image
• ONNX Runtime:      ~43 ms/image  
Speed difference:    2.8× faster

Memory footprint:    Reduced from 150 MB → 12 MB

The ONNX model can be deployed across multiple environments, including 
- mobile devices, 
- embedded systems, 
- web applications, and 
- CPU-based servers.

Building the Production System

System Architecture

┌─────────────────────────────────────────┐
│   User: Upload Dermoscopic Image            │
└────────────────┬────────────────────────┘
                   │
┌────────────────▼───────────────────────┐
│      FastAPI Server (main.py)               │
├─────────────────────────────────────────┤
│ • File validation (size, format)            │
│ • Rate limiting (60 req/min)                │
│ • Error handling & logging                  │
└────────────────┬────────────────────────┘
                   │
┌────────────────▼────────────────────────┐
│   ONNX Inference Engine (inference.py)      │
├─────────────────────────────────────────┤
│ 1. Preprocess image                         │
│    ├─ OpenCV decode                         │
│    ├─ Resize to 300×300                     │
│    └─ Normalize (ImageNet stats)            │
│ 2. ONNX Runtime inference                   │
│ 3. Softmax → probabilities                  │
└────────────────┬────────────────────────┘
                   │
┌────────────────▼────────────────────────┐
│   Clinical Decision Support                 │
├─────────────────────────────────────────┤
│ • Flag low confidence (< 50%)               │
│ • Flag melanoma suspicion                   │
│ • Recommend dermatologist review            │
└────────────────┬────────────────────────┘
                  │
┌────────────────▼────────────────────────┐
│   JSON Response + Streamlit UI              │
│                                             │
│ {"prediction": "Melanoma",                  │
│  "confidence": 0.82,                        │
│  "requires_review": true}                   │
└─────────────────────────────────────────┘

Robust Image Preprocessing

class SkinInference:
    def preprocess(self, image_bytes: bytes) -> np.ndarray:
        """Convert raw bytes to model input with validation."""
        
        try:
            # Validate input
            if len(image_bytes) == 0:
                raise ValueError("Image is empty")
            
            # Decode image (handles JPEG, PNG, etc.)
            nparr = np.frombuffer(image_bytes, np.uint8)
            img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
            
            if img is None:
                raise ValueError(
                    "Failed to decode image. "
                    "Ensure it's a valid JPEG/PNG file."
                )
            
            # Convert color space
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            
            # Resize to model input size
            img = cv2.resize(img, (300, 300))
            
            # Normalize using ImageNet statistics
            img = img.astype(np.float32) / 255.0
            img = (img - np.array([0.485, 0.456, 0.406])) / \
                  np.array([0.229, 0.224, 0.225])
            
            # Convert HWC to CHW and add batch dimension
            img = np.transpose(img, (2, 0, 1))
            img = np.expand_dims(img, axis=0)
            
            return img.astype(np.float32)
        
        except ValueError as e:
            raise ValueError(f"Preprocessing failed: {e}")
        except Exception as e:
            raise RuntimeError(
                f"Unexpected error during preprocessing: {e}"
            )

Clinical Decision Logic

@app.post("/predict")
async def predict_lesion(file: UploadFile = File(...)):
    """Classify skin lesion with clinical guidance."""
    
    # Validate file
    if file.content_type not in {"image/jpeg", "image/png"}:
        raise HTTPException(
            status_code=400,
            detail="Invalid file type. Use JPEG or PNG."
        )
    
    # Run inference
    image_bytes = await file.read()
    result = predictor.predict_with_metadata(image_bytes)
    
    # Clinical decision support
    needs_dermatologist = (
        result["is_uncertain"] or  # Low confidence?
        result["prediction"] == "Melanoma"  # High-risk lesion?
    )
    
    return {
        "filename": file.filename,
        "prediction": result["prediction"],
        "confidence": round(result["confidence"], 4),
        "is_uncertain": result["is_uncertain"],
        "requires_dermatologist_review": needs_dermatologist,
        "clinical_note": (
            "High-risk prediction!"
            "Recommend immediate dermatologist consultation."
            if needs_dermatologist
            else "Monitor for changes over time."
        ),
        "all_probabilities": {
            name: round(prob, 4)
            for name, prob in result["all_predictions"]
        }
    }

Testing & Validation:

Comprehensive tests help identify issues before deployment:

import pytest
from app.inference import SkinInference

class TestPreprocessing:
    def test_valid_image_preprocessing(self, valid_image_bytes):
        """Does preprocessing handle valid images correctly?"""
        inference = SkinInference("models/skin_classifier_v1.onnx")
        processed = inference.preprocess(valid_image_bytes)
        
        assert processed.shape == (1, 3, 300, 300)
        assert processed.dtype == np.float32
        assert -5 < processed.min() < processed.max() < 5
    
    def test_empty_image_raises_error(self, empty_bytes):
        """Does it reject empty files?"""
        inference = SkinInference("models/skin_classifier_v1.onnx")
        
        with pytest.raises(ValueError, match="Image is empty"):
            inference.preprocess(empty_bytes)
    
    def test_invalid_format_raises_error(self, invalid_bytes):
        """Does it reject corrupted images?"""
        inference = SkinInference("models/skin_classifier_v1.onnx")
        
        with pytest.raises(ValueError, match="Failed to decode"):
            inference.preprocess(invalid_bytes)

class TestInference:
    def test_predictions_are_valid(self, valid_image_bytes):
        """Are predictions normalized and sorted?"""
        inference = SkinInference("models/skin_classifier_v1.onnx")
        predictions = inference.predict(valid_image_bytes)
        
        # Should return all 9 classes
        assert len(predictions) == 9
        
        # Should be sorted by confidence
        confidences = [p[1] for p in predictions]
        assert confidences == sorted(confidences, reverse=True)
        
        # Should sum to 1.0 (softmax property)
        assert 0.99 < sum(confidences) < 1.01

Test Coverage:

Image preprocessing (JPEG/PNG, edge cases)
Inference correctness (softmax, ordering)
ONNX validation (PyTorch ↔ ONNX equivalence)
API error handling (invalid files, oversized images)
Batch processing limits

Result: 6/6 tests passing

Deployment: Making It Real

Option 1: Local FastAPI Server

# Install dependencies
pip install -r requirements.txt

# Start server
python -m uvicorn app.main:app --host 0.0.0.0 --port 8000

# Interactive API at http://localhost:8000/docs

Option 2: Docker Container

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt --no-cache-dir

COPY . .

EXPOSE 8000

CMD ["python", "-m", "uvicorn", "app.main:app", \
     "--host", "0.0.0.0", "--port", "8000"]

# Build and run
docker build -t skin-lesion-api:v1 .
docker run -p 8000:8000 skin-lesion-api:v1

Option 3: Streamlit Interactive UI

streamlit run app/ui.py

The Streamlit interface provides:

Drag-and-drop image upload
Real-time predictions
Confidence visualization with bar charts
Clinical decision support messaging
Responsive mobile design

Key Learnings: Mistakes That Taught Me

1. ONNX Export Is Deceptively Fragile

# WRONG: Exporting from GPU
model.to('cuda')
torch.onnx.export(model, dummy_input, "model.onnx")
# Result: Numerical instability, subtle bugs

# CORRECT: Always export from CPU
model.to('cpu')
model.eval()
torch.onnx.export(model, dummy_input, "model.onnx")
# Result: Stable, reproducible exports

I discovered this the hard way when ONNX outputs differed significantly from PyTorch on the GPU. Always export from the CPU.

2. Validation Caught What Testing Missed

Although the observed difference was extremely small (4e-6), validating numerical consistency between PyTorch and ONNX provided confidence that the exported model behaved as expected. The validation step caught this early.

3. Medical Data Has Unique Challenges

Standard ML practices don't always transfer to healthcare:

Class imbalance: Some lesions are 10x rarer than others
Dataset bias: ISIC underrepresents certain skin tones
High stakes: Misclassification can harm real people
Regulatory: HIPAA and GDPR compliance are needed for real deployment

Solutions I implemented:

Weighted loss functions for fair training
Stratified splits for representative validation
Conservative uncertainty thresholds
Clear recommendations to consult dermatologists
Comprehensive error logging

4. Inference ≠ Training

Most ML courses focus on training. Real jobs focus on:

Latency: Can it respond in < 100ms?
Throughput: How many requests/second?
Memory: Will it fit in production constraints?
Cold start: How long to load the model?
Monitoring: Can you detect performance degradation?

I optimized for all of these, not just accuracy.

What's Next: The Road Ahead

Future improvements I'm planning:

# Model improvements
[ ] Ensemble multiple models for higher accuracy
[ ] Fine-tune on specialized dermatology datasets
[ ] Add confidence calibration (temperature scaling)

# Explainability
[ ] Grad-CAM heatmaps showing important regions
[ ] LIME for local interpretable explanations
[ ] Attention visualization

# Production hardening
[ ] GPU optimization with TensorRT
[ ] Model versioning & A/B testing
[ ] Performance monitoring & alerting
[ ] Drift detection & retraining triggers

# Deployment
[ ] Mobile app (iOS/Android with CoreML/TensorFlow Lite)
[ ] Web browser inference (ONNX.js)
[ ] Edge device optimization (TensorFlow Lite, OpenVINO)

Lessons from Moving Beyond Model Training

This project taught me something crucial that most ML students miss:

Training a model is only one part of the engineering challenge. Building a reliable, deployable, and maintainable system requires a different set of considerations.

Academic ML involves:

Clean datasets
GPU access
Simple metrics
No deployment concerns

Production ML requires:

Data validation and cleaning
Error handling for edge cases
Multiple metrics (latency, throughput, fairness)
Reproducible deployments
Monitoring and alerting
Clear documentation

This project reinforced an important lesson: model performance is only one part of a successful machine learning system. Robust preprocessing, validation, deployment, monitoring, and maintainability are equally important when transitioning from experimentation to production.

Code & Resources

Repository:

GitHub: github.com/vin0san/skin-lesion-onnx-api
Full source code with tests, configurations, and deployment guides

Key References:

ISIC Dataset: International Skin Imaging Collaboration
ONNX Standard: onnx.ai
EfficientNet Paper: Tan & Le, 2019
FastAPI Docs: fastapi.tiangolo.com

Let's Connect

Have questions about this approach? Interested in discussing production ML systems?

LinkedIn: linkedin.com/in/vineetkrr
GitHub: @vin0san
Email: vin280803@gmail.com

I'm happy to discuss:

ONNX optimization strategies
Medical AI systems
Production ML architecture
Model deployment patterns
Inference optimization

Last updated: June 2026

Tags: #MachineLearning #MLEngineering #ONNX #FastAPI #ProductionML #MedicalAI #DeepLearning #Python #InferenceOptimization #SystemsDesign

Building a Production-Ready Medical AI System: From PyTorch to ONNX Inference

The Problem: Why I Built This

The Architecture: Making Smart Trade-offs

Choosing the Right Backbone

The Training Pipeline: Systematic Experimentation

The Class Imbalance Problem

The Ablation Study: Finding the Real Bottleneck

ONNX Export & Optimization

The Export Process

Numerical Validation

The Impact: Numbers That Matter

Building the Production System

System Architecture

Robust Image Preprocessing

Clinical Decision Logic

Testing & Validation:

Deployment: Making It Real

Option 1: Local FastAPI Server

Option 2: Docker Container

Option 3: Streamlit Interactive UI

Key Learnings: Mistakes That Taught Me

1. ONNX Export Is Deceptively Fragile

2. Validation Caught What Testing Missed

3. Medical Data Has Unique Challenges

4. Inference ≠ Training

What's Next: The Road Ahead

Lessons from Moving Beyond Model Training

Code & Resources

Let's Connect

Comments

More from this blog

Detecting Fraud Without Labels: A GMM Approach to Unsupervised Anomaly Detection

Command Palette

The Problem: Why I Built This

The Architecture: Making Smart Trade-offs

Choosing the Right Backbone

The Training Pipeline: Systematic Experimentation

The Class Imbalance Problem

The Ablation Study: Finding the Real Bottleneck

ONNX Export & Optimization

The Export Process

Numerical Validation

The Impact: Numbers That Matter

Building the Production System

System Architecture

Robust Image Preprocessing

Clinical Decision Logic

Testing & Validation:

Deployment: Making It Real

Option 1: Local FastAPI Server

Option 2: Docker Container

Option 3: Streamlit Interactive UI

Key Learnings: Mistakes That Taught Me

1. ONNX Export Is Deceptively Fragile

2. Validation Caught What Testing Missed

3. Medical Data Has Unique Challenges

4. Inference ≠ Training

What's Next: The Road Ahead

Lessons from Moving Beyond Model Training

Code & Resources

Let's Connect

Comments

More from this blog