Skip to content

feat(deploy): Optional Azure ML mirror for OSMO training runs #668

@katriendg

Description

@katriendg

Component

OSMO Control Plane

Problem Statement

OSMO-submitted training runs (IL and RL) write checkpoints and tensorboard logs to the cluster's PVC, but there is no first-class path to land them in an Azure ML workspace as governed model versions. Operators who want a versioned, shareable home for trained policies — or want to feed AzureML-based deployment gating — have to either rely on the in-process registration paths inside training (which already double-register in IL and are coupled to training success) or hand-roll an upload after the fact. There is also no way to retro-mirror an older run whose registration failed or never happened.

Proposed Solution

Add an opt-in Azure ML mirror that lives off the training critical path:

  • Deploy-time toggle in infrastructure/setup/04-deploy-osmo-backend.sh (--skip-azureml-pod-template) that injects AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZUREML_WORKSPACE_NAME into the existing default_user OSMO pod template. Auto-skips when no AzureML workspace appears in Terraform outputs.
  • Standalone mirror script at training/utils/aml_mirror.py — framework-agnostic, authenticates via DefaultAzureCredential (Workload Identity), uploads tensorboard + filtered final checkpoint, registers a new model version via mlflow.register_model().
  • Replay workflow at workflows/osmo/replay-azureml.yaml plus submission helper training/utils/replay-azureml.sh so any completed run can be replayed by RUN_ID without modifying training workflows.
  • Documentation as a single "Azure ML Mirror (Optional)" section in infrastructure/setup/README.md covering When/Prerequisites/Enabling/Using/Disabling/Troubleshooting.

No existing IL or RL training workflow YAML is modified. Mirror failure cannot break a training run.

Alternatives Considered

  • Inline mirror in every training workflow YAML — rejected: would create a 4th AzureML registration path on top of the three existing in-process paths in IL, producing ~12 model versions per 5-checkpoint run, and would force every workflow author to opt out.
  • New azureml_config pod template added to override_pod_template — deferred: OSMO container env-merge semantics across multiple override_pod_template entries are not verified in this codebase. Extending default_user directly sidesteps the unknown.
  • Bake aml_mirror.py into a utility container image — deferred: cleaner long-term but requires an image build pipeline. Use OSMO files: distribution first.
  • New Terraform should_enable_azureml_mirror variable — rejected: no should_* precedent in deploy scripts; tf_get auto-detection on the existing AzureML output is sufficient and matches repo convention.
  • Add osmo_aml_storage role assignment in Terraform — rejected: AzureML workspace's backing storage IS the main storage account; the existing Storage Blob Data Contributor role already covers it.

Additional Context

Prerequisites already in place:

  • OSMO managed identity already has AzureML Data Scientist and Storage Blob Data Contributor on the workspace's backing storage (verified in infrastructure/terraform/modules/platform/role-assignments.tf).
  • azureml_workspace.value.{name,id,workspace_id} already exposed at root Terraform output level.
  • Federated credential for system:serviceaccount:<workflows-ns>:osmo-workflow already present.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or improvement request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions