feat!(infrastructure): support configurable Azure ML compute clusters by fbeltrao · Pull Request #687 · microsoft/physical-ai-toolchain

fbeltrao · 2026-05-12T08:33:20Z

Description

Partially addresses #384 (flexible AML compute clusters)

Replace the single Azure ML compute-cluster interface with a flexible aml_compute_clusters map. The Terraform root module now passes a map of named compute cluster definitions into module.platform, and the platform module creates zero, one, or many Azure ML compute clusters with for_each instead of a single optional resource.

Each compute cluster now supports independent settings for VM size, priority, autoscaling bounds, idle scale-down, subnet attachment, public IP exposure, SSH public access, identity type, and optional location. The module continues to block subnet attachment when AML managed network isolation is enabled, but now applies that validation to every configured cluster instead of one shared object.

This PR also migrates outputs from a singular aml_compute_cluster value to an aml_compute_clusters map, updates example tfvars and Terraform reference docs to the new configuration shape, and expands Terraform tests to cover defaults, conditionals, validation, and outputs for the multi-cluster deployment path.

This change is breaking for Terraform consumers because the previous should_deploy_aml_compute and aml_compute_config inputs are replaced by aml_compute_clusters, and callers must update any code that reads the old singular output.

Migration Guide

Existing deployments that used should_deploy_aml_compute = true and aml_compute_config must migrate to aml_compute_clusters before applying this change.

# Before
should_deploy_aml_compute = true
aml_compute_config = {
  vm_size               = "Standard_NC4as_T4_v3"
  vm_priority           = "LowPriority"
  min_node_count        = 0
  max_node_count        = 1
  scale_down_after_idle = "PT5M"
  cluster_name          = "gpu-cluster"
}

# After
aml_compute_clusters = {
  "gpu-cluster" = {
    vm_size               = "Standard_NC4as_T4_v3"
    vm_priority           = "LowPriority"
    min_node_count        = 0
    max_node_count        = 1
    scale_down_after_idle = "PT5M"
    identity_type         = "SystemAssigned"
  }
}

Set identity_type = "SystemAssigned" to preserve the previous default compute-cluster identity behavior. Omit identity_type to use the new default, UserAssigned, with the platform managed identity.

Move existing Terraform state before applying to avoid destroying and recreating the cluster:

terraform state mv \
  'module.platform.azurerm_machine_learning_compute_cluster.gpu[0]' \
  'module.platform.azurerm_machine_learning_compute_cluster.gpu["gpu-cluster"]'

Update output consumers from terraform output aml_compute_cluster to the new keyed map output:

terraform output -json aml_compute_clusters | jq '."gpu-cluster"'

Type of Change

🐛 Bug fix (non-breaking change fixing an issue)
✨ New feature (non-breaking change adding functionality)
💥 Breaking change (fix or feature causing existing functionality to change)
📚 Documentation update
🏗️ Infrastructure change (Terraform/IaC)
♻️ Refactoring (no functional changes)

Component(s) Affected

infrastructure/terraform/prerequisites/ - Azure subscription setup
infrastructure/terraform/ - Terraform infrastructure
infrastructure/setup/ - OSMO control plane / Helm
workflows/ - Training and evaluation workflows
training/ - Training pipelines and scripts
docs/ - Documentation

Testing Performed

Terraform plan reviewed (no unexpected changes)
Terraform apply tested in dev environment
Training scripts tested locally with Isaac Sim
OSMO workflow submitted successfully
Smoke tests passed (smoke_test_azure.py)

Notes:

The branch includes Terraform test updates for multi-cluster defaults, conditionals, validation, and outputs.
This draft does not claim terraform plan or terraform apply execution because those results were not verified while preparing the tracking artifact.

Documentation Impact

No documentation changes needed
Documentation updated in this PR
Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

Linked to issue being fixed
Regression test included, OR
Justification for no regression test:

Checklist

My code follows the project conventions
Commit messages follow conventional commit format
I have performed a self-review
Documentation impact assessed above
No new linting warnings introduced

BREAKING CHANGE: Replaces should_deploy_aml_compute, aml_compute_config, and the singular aml_compute_cluster output with the aml_compute_clusters map. Existing deployments must update variables, move Terraform state from gpu[0] to the keyed cluster address, and set identity_type = "SystemAssigned" to preserve the previous compute-cluster identity behavior.

codecov-commenter · 2026-05-12T08:35:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.65%. Comparing base (c7c3aca) to head (7dabd84).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

❌ Your project status has failed because the head coverage (64.51%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #687   +/-   ##
=======================================
  Coverage   88.65%   88.65%           
=======================================
  Files         252      252           
  Lines       18018    18018           
  Branches     2451     2451           
=======================================
  Hits        15974    15974           
  Misses       1577     1577           
  Partials      467      467

Flag	Coverage Δ	*Carryforward flag
pester	`83.16% <ø> (ø)`	Carriedforward from 7e73d3e
pytest-data-pipeline	`100.00% <ø> (ø)`	Carriedforward from 7e73d3e
pytest-dataviewer	`93.60% <ø> (ø)`	Carriedforward from 7e73d3e
pytest-dm-tools	`100.00% <ø> (ø)`	Carriedforward from 7e73d3e
pytest-evaluation	`99.51% <ø> (ø)`
pytest-fuzz	`4.89% <ø> (ø)`	Carriedforward from 7e73d3e
pytest-inference	`100.00% <ø> (ø)`	Carriedforward from 7e73d3e
pytest-training	`93.32% <ø> (ø)`	Carriedforward from 7e73d3e
vitest	`86.30% <ø> (ø)`	Carriedforward from 7e73d3e
vitest-app	`86.30% <ø> (ø)`	Carriedforward from 7e73d3e
vitest-components	`86.30% <ø> (ø)`	Carriedforward from 7e73d3e
vitest-features	`86.30% <ø> (ø)`	Carriedforward from 7e73d3e
vitest-lib	`86.30% <ø> (ø)`	Carriedforward from 7e73d3e
vitest-state	`86.30% <ø> (ø)`	Carriedforward from 7e73d3e

*This pull request uses carry forward flags. Click here to find out more.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- replace single compute configuration with a map for multiple clusters - update validation rules for cluster names and properties - modify outputs to reflect new cluster structure - enhance tests to cover new cluster configurations 🔧 - Generated by Copilot

…ple cluster - Remove override_during = plan from conditionals.tftest.hcl mock providers to match bare mock_provider style used in defaults, outputs, validation tests - Change aml_compute_clusters example to empty map with cluster entry commented out for safe copy-paste behavior matching prior should_deploy_aml_compute style 🤖 - Generated by Copilot

katriendg

Thanks for the thorough work here — the for_each refactoring, validation rules, and test coverage are all well-structured. A few items to address before merge, mostly around the breaking change lifecycle and migration safety.

1. Breaking change commit convention

The PR correctly marks 💥 Breaking change in the description, but release-please needs either a BREAKING CHANGE: footer in the squash-merge commit or a feat! type suffix to generate the ⚠ BREAKING CHANGES section in CHANGELOG.md and trigger the correct version bump.

Before merge, please add a BREAKING CHANGE: block to the PR body so it flows into the squash-merge commit:

BREAKING CHANGE: The `should_deploy_aml_compute` boolean and `aml_compute_config` object variables
are replaced by `aml_compute_clusters`, a map of named cluster definitions. The `aml_compute_cluster`
output (singular) is replaced by `aml_compute_clusters` (map keyed by cluster name). Existing
deployments require variable migration and `terraform state mv` to avoid cluster recreation.

When merging, use feat!(infrastructure): support configurable Azure ML compute clusters as the squash-merge commit subject, or ensure the BREAKING CHANGE: footer from the PR body is preserved.

2. Migration guide for existing deployments

Users with existing should_deploy_aml_compute = true and aml_compute_config in their terraform.tfvars will hit immediate Terraform errors after pulling this change. The PR should include migration guidance covering:

Variable replacement in terraform.tfvars:

# Before
should_deploy_aml_compute = true
aml_compute_config = {
  vm_size               = "Standard_NC4as_T4_v3"
  vm_priority           = "LowPriority"
  min_node_count        = 0
  max_node_count        = 1
  scale_down_after_idle = "PT5M"
  cluster_name          = "gpu-cluster"
}

# After
aml_compute_clusters = {
  "gpu-cluster" = {
    vm_size               = "Standard_NC4as_T4_v3"
    vm_priority           = "LowPriority"
    min_node_count        = 0
    max_node_count        = 1
    scale_down_after_idle = "PT5M"
  }
}

Terraform state migration (critical — prevents cluster destruction):

terraform state mv \
  'module.platform.azurerm_machine_learning_compute_cluster.gpu[0]' \
  'module.platform.azurerm_machine_learning_compute_cluster.gpu["gpu-cluster"]'

Without this command, terraform plan will show a destroy + create, causing downtime and potential training job failures.

Output consumption update:

# Before
terraform output aml_compute_cluster

# After
terraform output -json aml_compute_clusters | jq '."gpu-cluster"'

Consider adding this guidance to the PR description, the infrastructure docs, or both.

3. State migration documentation convention

The count → for_each change means gpu[0] and gpu["key"] are different Terraform state addresses. Consider adding a > [!WARNING] callout to the infrastructure reference docs noting the one-time state migration requirement. This establishes a documentation convention for future count-to-for_each migrations.

katriendg · 2026-05-12T13:43:36Z

+      node_public_ip_enabled    = coalesce(cluster.node_public_ip_enabled, false)
+      ssh_public_access_enabled = coalesce(cluster.ssh_public_access_enabled, false)
+      identity_type             = coalesce(cluster.identity_type, "UserAssigned")
+      identity_ids              = coalesce(cluster.identity_type, "UserAssigned") == "UserAssigned" ? [local.aml_user_assigned_identity_id] : null


Identity type default changed from SystemAssigned to UserAssigned

The old resource hardcoded identity { type = "SystemAssigned" }. This now defaults to "UserAssigned" with the platform managed identity when identity_type is omitted. This is an additional semantic breaking change worth calling out in the PR description and migration guide.

Existing clusters with SystemAssigned identity will see their identity type changed on next apply, which may trigger resource recreation depending on Azure provider behavior.

Suggestion for migration guide:

If your existing cluster uses SystemAssigned identity (the previous default), add `identity_type = "SystemAssigned"` to your cluster definition to preserve current behavior. The new default is UserAssigned with the platform managed identity.

I called out the identity default change in the PR migration guidance and documented the preservation path for existing deployments. The migration notes now explicitly say to set identity_type to SystemAssigned to keep prior behavior, and clarify that omitting it uses the new UserAssigned default with the platform managed identity.

katriendg · 2026-05-12T13:43:36Z

+    ])
+    error_message = "aml_compute_clusters identity_type values must be either SystemAssigned or UserAssigned."
  }
 }


Consider adding ISO 8601 validation for scale_down_after_idle

The other fields in this variable have thorough validation — scale_down_after_idle accepts any string without checks. A malformed value (e.g., "5 minutes" instead of "PT5M") would only fail at Azure API apply time with a cryptic error.

Suggested change

}

error_message = "aml_compute_clusters identity_type values must be either SystemAssigned or UserAssigned."

}

validation {

condition = alltrue([

for _, cluster in var.aml_compute_clusters : can(regex("^P(T\\d+[HMS])+$", cluster.scale_down_after_idle))

])

error_message = "aml_compute_clusters scale_down_after_idle must be an ISO 8601 duration (e.g., PT5M, PT30S, PT1H)."

}

Added validation for scale_down_after_idle so malformed values fail during Terraform validation instead of later at Azure apply time. It now enforces ISO 8601 duration values and includes examples in the validation error message.

… validation - add migration instructions for transitioning to new compute cluster settings - introduce validation for scale_down_after_idle parameter in AML compute clusters 🔧 - Generated by Copilot

rezatnoMsirhC · 2026-05-15T21:41:52Z

+  }))
+  description = "AzureML managed compute clusters keyed by Azure ML compute cluster name. Empty map deploys no clusters."
+  default     = {}



The name validation regex allows 2-character cluster names because {0,22} permits an empty middle segment (1 + 0 + 1 = 2 character minimum). Azure ML compute cluster names require a minimum of 3 characters, so a name like ab passes this validation but fails at apply time with an Azure API error.

Change {0,22} to {1,22} to match the Azure minimum (3 chars), and update the error message to reflect the correct range. The same fix is needed in the root infrastructure/terraform/variables.tf validation block.

Suggested change

for cluster_name, _ in var.aml_compute_clusters : can(regex("^[A-Za-z0-9][A-Za-z0-9-]{1,22}[A-Za-z0-9]$", cluster_name))

rezatnoMsirhC · 2026-05-15T21:42:01Z

-    cluster_name          = "gpu-cluster"
-    subnet_id             = null
+  validation {
+    condition = alltrue([


The error message claims "2-24 characters" but with the regex fix above the actual enforced range becomes 3-24. The message should be updated to match.

Suggested change

condition = alltrue([

error_message = "aml_compute_clusters keys must be 3-24 characters, start and end with an alphanumeric character, and contain only letters, numbers, and hyphens."

rezatnoMsirhC · 2026-05-15T21:42:12Z

-# AzureML compute cluster (when enabled)
-terraform output aml_compute_cluster
+# AzureML compute clusters keyed by cluster name
+terraform output -json aml_compute_clusters | jq -r '."gpu-training".name'


The migration guide earlier in this file uses gpu-cluster as the example cluster name throughout, but this example references gpu-training. Readers following the migration guide end-to-end will encounter a mismatch. Using a consistent name across all examples in the file would avoid the confusion.

Suggested change

terraform output -json aml_compute_clusters | jq -r '."gpu-training".name'

terraform output -json aml_compute_clusters | jq -r '."gpu-cluster".name'

fbeltrao requested a review from a team as a code owner May 12, 2026 08:33

fbeltrao added 2 commits May 12, 2026 09:13

fbeltrao force-pushed the feat/384-customize-aml-compute-cluster branch from 4b68d39 to 7e73d3e Compare May 12, 2026 09:13

katriendg reviewed May 12, 2026

View reviewed changes

feat(infrastructure): implement AzureML compute cluster migration and…

7dabd84

… validation - add migration instructions for transitioning to new compute cluster settings - introduce validation for scale_down_after_idle parameter in AML compute clusters 🔧 - Generated by Copilot

fbeltrao changed the title ~~feat(infrastructure): support configurable Azure ML compute clusters~~ feat!(infrastructure): support configurable Azure ML compute clusters May 13, 2026

rezatnoMsirhC reviewed May 15, 2026

View reviewed changes

rezatnoMsirhC approved these changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!(infrastructure): support configurable Azure ML compute clusters#687

feat!(infrastructure): support configurable Azure ML compute clusters#687
fbeltrao wants to merge 3 commits into
microsoft:mainfrom
fbeltrao:feat/384-customize-aml-compute-cluster

fbeltrao commented May 12, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 12, 2026 •

edited

Loading

Uh oh!

katriendg left a comment

Uh oh!

katriendg May 12, 2026

Uh oh!

fbeltrao May 13, 2026

Uh oh!

katriendg May 12, 2026

Uh oh!

fbeltrao May 13, 2026

Uh oh!

rezatnoMsirhC May 15, 2026

Uh oh!

rezatnoMsirhC May 15, 2026

Uh oh!

rezatnoMsirhC May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


	for cluster_name, _ in var.aml_compute_clusters : can(regex("^[A-Za-z0-9][A-Za-z0-9-]{1,22}[A-Za-z0-9]$", cluster_name))

	condition = alltrue([
	error_message = "aml_compute_clusters keys must be 3-24 characters, start and end with an alphanumeric character, and contain only letters, numbers, and hyphens."

	terraform output -json aml_compute_clusters \| jq -r '."gpu-training".name'
	terraform output -json aml_compute_clusters \| jq -r '."gpu-cluster".name'

Conversation

fbeltrao commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Migration Guide

Type of Change

Component(s) Affected

Testing Performed

Documentation Impact

Bug Fix Checklist

Checklist

Uh oh!

codecov-commenter commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

katriendg left a comment

Choose a reason for hiding this comment

1. Breaking change commit convention

2. Migration guide for existing deployments

3. State migration documentation convention

Uh oh!

katriendg May 12, 2026

Choose a reason for hiding this comment

Uh oh!

fbeltrao May 13, 2026

Choose a reason for hiding this comment

Uh oh!

katriendg May 12, 2026

Choose a reason for hiding this comment

Uh oh!

fbeltrao May 13, 2026

Choose a reason for hiding this comment

Uh oh!

rezatnoMsirhC May 15, 2026

Choose a reason for hiding this comment

Uh oh!

rezatnoMsirhC May 15, 2026

Choose a reason for hiding this comment

Uh oh!

rezatnoMsirhC May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fbeltrao commented May 12, 2026 •

edited

Loading

codecov-commenter commented May 12, 2026 •

edited

Loading