feat(deploy): Optional Azure ML mirror for OSMO training runs

### Component

OSMO Control Plane

### Problem Statement

OSMO-submitted training runs (IL and RL) write checkpoints and tensorboard logs to the cluster's PVC, but there is no first-class path to land them in an Azure ML workspace as governed model versions. Operators who want a versioned, shareable home for trained policies — or want to feed AzureML-based deployment gating — have to either rely on the in-process registration paths inside training (which already double-register in IL and are coupled to training success) or hand-roll an upload after the fact. There is also no way to retro-mirror an older run whose registration failed or never happened.


### Proposed Solution

Add an opt-in Azure ML mirror that lives **off the training critical path**:

* **Deploy-time toggle** in `infrastructure/setup/04-deploy-osmo-backend.sh` (`--skip-azureml-pod-template`) that injects `AZURE_SUBSCRIPTION_ID`, `AZURE_RESOURCE_GROUP`, `AZUREML_WORKSPACE_NAME` into the existing `default_user` OSMO pod template. Auto-skips when no AzureML workspace appears in Terraform outputs.
* **Standalone mirror script** at `training/utils/aml_mirror.py` — framework-agnostic, authenticates via `DefaultAzureCredential` (Workload Identity), uploads tensorboard + filtered final checkpoint, registers a new model version via `mlflow.register_model()`.
* **Replay workflow** at `workflows/osmo/replay-azureml.yaml` plus submission helper `training/utils/replay-azureml.sh` so any completed run can be replayed by `RUN_ID` without modifying training workflows.
* **Documentation** as a single "Azure ML Mirror (Optional)" section in `infrastructure/setup/README.md` covering When/Prerequisites/Enabling/Using/Disabling/Troubleshooting.

No existing IL or RL training workflow YAML is modified. Mirror failure cannot break a training run.

### Alternatives Considered

* **Inline mirror in every training workflow YAML** — rejected: would create a 4th AzureML registration path on top of the three existing in-process paths in IL, producing ~12 model versions per 5-checkpoint run, and would force every workflow author to opt out.
* **New `azureml_config` pod template added to `override_pod_template`** — deferred: OSMO container env-merge semantics across multiple `override_pod_template` entries are not verified in this codebase. Extending `default_user` directly sidesteps the unknown.
* **Bake `aml_mirror.py` into a utility container image** — deferred: cleaner long-term but requires an image build pipeline. Use OSMO `files:` distribution first.
* **New Terraform `should_enable_azureml_mirror` variable** — rejected: no `should_*` precedent in deploy scripts; `tf_get` auto-detection on the existing AzureML output is sufficient and matches repo convention.
* **Add `osmo_aml_storage` role assignment in Terraform** — rejected: AzureML workspace's backing storage IS the main storage account; the existing `Storage Blob Data Contributor` role already covers it.

### Additional Context

**Prerequisites already in place:**

* OSMO managed identity already has `AzureML Data Scientist` and `Storage Blob Data Contributor` on the workspace's backing storage (verified in `infrastructure/terraform/modules/platform/role-assignments.tf`).
* `azureml_workspace.value.{name,id,workspace_id}` already exposed at root Terraform output level.
* Federated credential for `system:serviceaccount:<workflows-ns>:osmo-workflow` already present.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deploy): Optional Azure ML mirror for OSMO training runs #668

Component

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(deploy): Optional Azure ML mirror for OSMO training runs #668

Description

Component

Problem Statement

Proposed Solution

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions