Component
OSMO Control Plane
Problem Statement
OSMO-submitted training runs (IL and RL) write checkpoints and tensorboard logs to the cluster's PVC, but there is no first-class path to land them in an Azure ML workspace as governed model versions. Operators who want a versioned, shareable home for trained policies — or want to feed AzureML-based deployment gating — have to either rely on the in-process registration paths inside training (which already double-register in IL and are coupled to training success) or hand-roll an upload after the fact. There is also no way to retro-mirror an older run whose registration failed or never happened.
Proposed Solution
Add an opt-in Azure ML mirror that lives off the training critical path:
- Deploy-time toggle in
infrastructure/setup/04-deploy-osmo-backend.sh (--skip-azureml-pod-template) that injects AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZUREML_WORKSPACE_NAME into the existing default_user OSMO pod template. Auto-skips when no AzureML workspace appears in Terraform outputs.
- Standalone mirror script at
training/utils/aml_mirror.py — framework-agnostic, authenticates via DefaultAzureCredential (Workload Identity), uploads tensorboard + filtered final checkpoint, registers a new model version via mlflow.register_model().
- Replay workflow at
workflows/osmo/replay-azureml.yaml plus submission helper training/utils/replay-azureml.sh so any completed run can be replayed by RUN_ID without modifying training workflows.
- Documentation as a single "Azure ML Mirror (Optional)" section in
infrastructure/setup/README.md covering When/Prerequisites/Enabling/Using/Disabling/Troubleshooting.
No existing IL or RL training workflow YAML is modified. Mirror failure cannot break a training run.
Alternatives Considered
- Inline mirror in every training workflow YAML — rejected: would create a 4th AzureML registration path on top of the three existing in-process paths in IL, producing ~12 model versions per 5-checkpoint run, and would force every workflow author to opt out.
- New
azureml_config pod template added to override_pod_template — deferred: OSMO container env-merge semantics across multiple override_pod_template entries are not verified in this codebase. Extending default_user directly sidesteps the unknown.
- Bake
aml_mirror.py into a utility container image — deferred: cleaner long-term but requires an image build pipeline. Use OSMO files: distribution first.
- New Terraform
should_enable_azureml_mirror variable — rejected: no should_* precedent in deploy scripts; tf_get auto-detection on the existing AzureML output is sufficient and matches repo convention.
- Add
osmo_aml_storage role assignment in Terraform — rejected: AzureML workspace's backing storage IS the main storage account; the existing Storage Blob Data Contributor role already covers it.
Additional Context
Prerequisites already in place:
- OSMO managed identity already has
AzureML Data Scientist and Storage Blob Data Contributor on the workspace's backing storage (verified in infrastructure/terraform/modules/platform/role-assignments.tf).
azureml_workspace.value.{name,id,workspace_id} already exposed at root Terraform output level.
- Federated credential for
system:serviceaccount:<workflows-ns>:osmo-workflow already present.
Component
OSMO Control Plane
Problem Statement
OSMO-submitted training runs (IL and RL) write checkpoints and tensorboard logs to the cluster's PVC, but there is no first-class path to land them in an Azure ML workspace as governed model versions. Operators who want a versioned, shareable home for trained policies — or want to feed AzureML-based deployment gating — have to either rely on the in-process registration paths inside training (which already double-register in IL and are coupled to training success) or hand-roll an upload after the fact. There is also no way to retro-mirror an older run whose registration failed or never happened.
Proposed Solution
Add an opt-in Azure ML mirror that lives off the training critical path:
infrastructure/setup/04-deploy-osmo-backend.sh(--skip-azureml-pod-template) that injectsAZURE_SUBSCRIPTION_ID,AZURE_RESOURCE_GROUP,AZUREML_WORKSPACE_NAMEinto the existingdefault_userOSMO pod template. Auto-skips when no AzureML workspace appears in Terraform outputs.training/utils/aml_mirror.py— framework-agnostic, authenticates viaDefaultAzureCredential(Workload Identity), uploads tensorboard + filtered final checkpoint, registers a new model version viamlflow.register_model().workflows/osmo/replay-azureml.yamlplus submission helpertraining/utils/replay-azureml.shso any completed run can be replayed byRUN_IDwithout modifying training workflows.infrastructure/setup/README.mdcovering When/Prerequisites/Enabling/Using/Disabling/Troubleshooting.No existing IL or RL training workflow YAML is modified. Mirror failure cannot break a training run.
Alternatives Considered
azureml_configpod template added tooverride_pod_template— deferred: OSMO container env-merge semantics across multipleoverride_pod_templateentries are not verified in this codebase. Extendingdefault_userdirectly sidesteps the unknown.aml_mirror.pyinto a utility container image — deferred: cleaner long-term but requires an image build pipeline. Use OSMOfiles:distribution first.should_enable_azureml_mirrorvariable — rejected: noshould_*precedent in deploy scripts;tf_getauto-detection on the existing AzureML output is sufficient and matches repo convention.osmo_aml_storagerole assignment in Terraform — rejected: AzureML workspace's backing storage IS the main storage account; the existingStorage Blob Data Contributorrole already covers it.Additional Context
Prerequisites already in place:
AzureML Data ScientistandStorage Blob Data Contributoron the workspace's backing storage (verified ininfrastructure/terraform/modules/platform/role-assignments.tf).azureml_workspace.value.{name,id,workspace_id}already exposed at root Terraform output level.system:serviceaccount:<workflows-ns>:osmo-workflowalready present.