Skip to content

feat!(infrastructure): support configurable Azure ML compute clusters#687

Open
fbeltrao wants to merge 3 commits into
microsoft:mainfrom
fbeltrao:feat/384-customize-aml-compute-cluster
Open

feat!(infrastructure): support configurable Azure ML compute clusters#687
fbeltrao wants to merge 3 commits into
microsoft:mainfrom
fbeltrao:feat/384-customize-aml-compute-cluster

Conversation

@fbeltrao
Copy link
Copy Markdown
Contributor

@fbeltrao fbeltrao commented May 12, 2026

Description

Partially addresses #384 (flexible AML compute clusters)

Replace the single Azure ML compute-cluster interface with a flexible aml_compute_clusters map. The Terraform root module now passes a map of named compute cluster definitions into module.platform, and the platform module creates zero, one, or many Azure ML compute clusters with for_each instead of a single optional resource.

Each compute cluster now supports independent settings for VM size, priority, autoscaling bounds, idle scale-down, subnet attachment, public IP exposure, SSH public access, identity type, and optional location. The module continues to block subnet attachment when AML managed network isolation is enabled, but now applies that validation to every configured cluster instead of one shared object.

This PR also migrates outputs from a singular aml_compute_cluster value to an aml_compute_clusters map, updates example tfvars and Terraform reference docs to the new configuration shape, and expands Terraform tests to cover defaults, conditionals, validation, and outputs for the multi-cluster deployment path.

This change is breaking for Terraform consumers because the previous should_deploy_aml_compute and aml_compute_config inputs are replaced by aml_compute_clusters, and callers must update any code that reads the old singular output.

Migration Guide

Existing deployments that used should_deploy_aml_compute = true and aml_compute_config must migrate to aml_compute_clusters before applying this change.

# Before
should_deploy_aml_compute = true
aml_compute_config = {
  vm_size               = "Standard_NC4as_T4_v3"
  vm_priority           = "LowPriority"
  min_node_count        = 0
  max_node_count        = 1
  scale_down_after_idle = "PT5M"
  cluster_name          = "gpu-cluster"
}

# After
aml_compute_clusters = {
  "gpu-cluster" = {
    vm_size               = "Standard_NC4as_T4_v3"
    vm_priority           = "LowPriority"
    min_node_count        = 0
    max_node_count        = 1
    scale_down_after_idle = "PT5M"
    identity_type         = "SystemAssigned"
  }
}

Set identity_type = "SystemAssigned" to preserve the previous default compute-cluster identity behavior. Omit identity_type to use the new default, UserAssigned, with the platform managed identity.

Move existing Terraform state before applying to avoid destroying and recreating the cluster:

terraform state mv \
  'module.platform.azurerm_machine_learning_compute_cluster.gpu[0]' \
  'module.platform.azurerm_machine_learning_compute_cluster.gpu["gpu-cluster"]'

Update output consumers from terraform output aml_compute_cluster to the new keyed map output:

terraform output -json aml_compute_clusters | jq '."gpu-cluster"'

Type of Change

  • 🐛 Bug fix (non-breaking change fixing an issue)
  • ✨ New feature (non-breaking change adding functionality)
  • 💥 Breaking change (fix or feature causing existing functionality to change)
  • 📚 Documentation update
  • 🏗️ Infrastructure change (Terraform/IaC)
  • ♻️ Refactoring (no functional changes)

Component(s) Affected

  • infrastructure/terraform/prerequisites/ - Azure subscription setup
  • infrastructure/terraform/ - Terraform infrastructure
  • infrastructure/setup/ - OSMO control plane / Helm
  • workflows/ - Training and evaluation workflows
  • training/ - Training pipelines and scripts
  • docs/ - Documentation

Testing Performed

  • Terraform plan reviewed (no unexpected changes)
  • Terraform apply tested in dev environment
  • Training scripts tested locally with Isaac Sim
  • OSMO workflow submitted successfully
  • Smoke tests passed (smoke_test_azure.py)

Notes:

  • The branch includes Terraform test updates for multi-cluster defaults, conditionals, validation, and outputs.
  • This draft does not claim terraform plan or terraform apply execution because those results were not verified while preparing the tracking artifact.

Documentation Impact

  • No documentation changes needed
  • Documentation updated in this PR
  • Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

  • Linked to issue being fixed
  • Regression test included, OR
  • Justification for no regression test:

Checklist

BREAKING CHANGE: Replaces should_deploy_aml_compute, aml_compute_config, and the singular aml_compute_cluster output with the aml_compute_clusters map. Existing deployments must update variables, move Terraform state from gpu[0] to the keyed cluster address, and set identity_type = "SystemAssigned" to preserve the previous compute-cluster identity behavior.

@fbeltrao fbeltrao requested a review from a team as a code owner May 12, 2026 08:33
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.65%. Comparing base (c7c3aca) to head (7dabd84).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

❌ Your project status has failed because the head coverage (64.51%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #687   +/-   ##
=======================================
  Coverage   88.65%   88.65%           
=======================================
  Files         252      252           
  Lines       18018    18018           
  Branches     2451     2451           
=======================================
  Hits        15974    15974           
  Misses       1577     1577           
  Partials      467      467           
Flag Coverage Δ *Carryforward flag
pester 83.16% <ø> (ø) Carriedforward from 7e73d3e
pytest-data-pipeline 100.00% <ø> (ø) Carriedforward from 7e73d3e
pytest-dataviewer 93.60% <ø> (ø) Carriedforward from 7e73d3e
pytest-dm-tools 100.00% <ø> (ø) Carriedforward from 7e73d3e
pytest-evaluation 99.51% <ø> (ø)
pytest-fuzz 4.89% <ø> (ø) Carriedforward from 7e73d3e
pytest-inference 100.00% <ø> (ø) Carriedforward from 7e73d3e
pytest-training 93.32% <ø> (ø) Carriedforward from 7e73d3e
vitest 86.30% <ø> (ø) Carriedforward from 7e73d3e
vitest-app 86.30% <ø> (ø) Carriedforward from 7e73d3e
vitest-components 86.30% <ø> (ø) Carriedforward from 7e73d3e
vitest-features 86.30% <ø> (ø) Carriedforward from 7e73d3e
vitest-lib 86.30% <ø> (ø) Carriedforward from 7e73d3e
vitest-state 86.30% <ø> (ø) Carriedforward from 7e73d3e

*This pull request uses carry forward flags. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

fbeltrao added 2 commits May 12, 2026 09:13
- replace single compute configuration with a map for multiple clusters
- update validation rules for cluster names and properties
- modify outputs to reflect new cluster structure
- enhance tests to cover new cluster configurations

🔧 - Generated by Copilot
…ple cluster

- Remove override_during = plan from conditionals.tftest.hcl mock providers
  to match bare mock_provider style used in defaults, outputs, validation tests
- Change aml_compute_clusters example to empty map with cluster entry commented
  out for safe copy-paste behavior matching prior should_deploy_aml_compute style

🤖 - Generated by Copilot
@fbeltrao fbeltrao force-pushed the feat/384-customize-aml-compute-cluster branch from 4b68d39 to 7e73d3e Compare May 12, 2026 09:13
Copy link
Copy Markdown
Collaborator

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough work here — the for_each refactoring, validation rules, and test coverage are all well-structured. A few items to address before merge, mostly around the breaking change lifecycle and migration safety.


1. Breaking change commit convention

The PR correctly marks 💥 Breaking change in the description, but release-please needs either a BREAKING CHANGE: footer in the squash-merge commit or a feat! type suffix to generate the ⚠ BREAKING CHANGES section in CHANGELOG.md and trigger the correct version bump.

Before merge, please add a BREAKING CHANGE: block to the PR body so it flows into the squash-merge commit:

BREAKING CHANGE: The `should_deploy_aml_compute` boolean and `aml_compute_config` object variables
are replaced by `aml_compute_clusters`, a map of named cluster definitions. The `aml_compute_cluster`
output (singular) is replaced by `aml_compute_clusters` (map keyed by cluster name). Existing
deployments require variable migration and `terraform state mv` to avoid cluster recreation.

When merging, use feat!(infrastructure): support configurable Azure ML compute clusters as the squash-merge commit subject, or ensure the BREAKING CHANGE: footer from the PR body is preserved.


2. Migration guide for existing deployments

Users with existing should_deploy_aml_compute = true and aml_compute_config in their terraform.tfvars will hit immediate Terraform errors after pulling this change. The PR should include migration guidance covering:

Variable replacement in terraform.tfvars:

# Before
should_deploy_aml_compute = true
aml_compute_config = {
  vm_size               = "Standard_NC4as_T4_v3"
  vm_priority           = "LowPriority"
  min_node_count        = 0
  max_node_count        = 1
  scale_down_after_idle = "PT5M"
  cluster_name          = "gpu-cluster"
}

# After
aml_compute_clusters = {
  "gpu-cluster" = {
    vm_size               = "Standard_NC4as_T4_v3"
    vm_priority           = "LowPriority"
    min_node_count        = 0
    max_node_count        = 1
    scale_down_after_idle = "PT5M"
  }
}

Terraform state migration (critical — prevents cluster destruction):

terraform state mv \
  'module.platform.azurerm_machine_learning_compute_cluster.gpu[0]' \
  'module.platform.azurerm_machine_learning_compute_cluster.gpu["gpu-cluster"]'

Without this command, terraform plan will show a destroy + create, causing downtime and potential training job failures.

Output consumption update:

# Before
terraform output aml_compute_cluster

# After
terraform output -json aml_compute_clusters | jq '."gpu-cluster"'

Consider adding this guidance to the PR description, the infrastructure docs, or both.


3. State migration documentation convention

The countfor_each change means gpu[0] and gpu["key"] are different Terraform state addresses. Consider adding a > [!WARNING] callout to the infrastructure reference docs noting the one-time state migration requirement. This establishes a documentation convention for future count-to-for_each migrations.

node_public_ip_enabled = coalesce(cluster.node_public_ip_enabled, false)
ssh_public_access_enabled = coalesce(cluster.ssh_public_access_enabled, false)
identity_type = coalesce(cluster.identity_type, "UserAssigned")
identity_ids = coalesce(cluster.identity_type, "UserAssigned") == "UserAssigned" ? [local.aml_user_assigned_identity_id] : null
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identity type default changed from SystemAssigned to UserAssigned

The old resource hardcoded identity { type = "SystemAssigned" }. This now defaults to "UserAssigned" with the platform managed identity when identity_type is omitted. This is an additional semantic breaking change worth calling out in the PR description and migration guide.

Existing clusters with SystemAssigned identity will see their identity type changed on next apply, which may trigger resource recreation depending on Azure provider behavior.

Suggestion for migration guide:

If your existing cluster uses SystemAssigned identity (the previous default), add
`identity_type = "SystemAssigned"` to your cluster definition to preserve current behavior.
The new default is UserAssigned with the platform managed identity.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called out the identity default change in the PR migration guidance and documented the preservation path for existing deployments. The migration notes now explicitly say to set identity_type to SystemAssigned to keep prior behavior, and clarify that omitting it uses the new UserAssigned default with the platform managed identity.

])
error_message = "aml_compute_clusters identity_type values must be either SystemAssigned or UserAssigned."
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding ISO 8601 validation for scale_down_after_idle

The other fields in this variable have thorough validation — scale_down_after_idle accepts any string without checks. A malformed value (e.g., "5 minutes" instead of "PT5M") would only fail at Azure API apply time with a cryptic error.

Suggested change
}
error_message = "aml_compute_clusters identity_type values must be either SystemAssigned or UserAssigned."
}
validation {
condition = alltrue([
for _, cluster in var.aml_compute_clusters : can(regex("^P(T\\d+[HMS])+$", cluster.scale_down_after_idle))
])
error_message = "aml_compute_clusters scale_down_after_idle must be an ISO 8601 duration (e.g., PT5M, PT30S, PT1H)."
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added validation for scale_down_after_idle so malformed values fail during Terraform validation instead of later at Azure apply time. It now enforces ISO 8601 duration values and includes examples in the validation error message.

… validation

- add migration instructions for transitioning to new compute cluster settings
- introduce validation for scale_down_after_idle parameter in AML compute clusters

🔧 - Generated by Copilot
@fbeltrao fbeltrao changed the title feat(infrastructure): support configurable Azure ML compute clusters feat!(infrastructure): support configurable Azure ML compute clusters May 13, 2026
}))
description = "AzureML managed compute clusters keyed by Azure ML compute cluster name. Empty map deploys no clusters."
default = {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name validation regex allows 2-character cluster names because {0,22} permits an empty middle segment (1 + 0 + 1 = 2 character minimum). Azure ML compute cluster names require a minimum of 3 characters, so a name like ab passes this validation but fails at apply time with an Azure API error.

Change {0,22} to {1,22} to match the Azure minimum (3 chars), and update the error message to reflect the correct range. The same fix is needed in the root infrastructure/terraform/variables.tf validation block.

Suggested change
for cluster_name, _ in var.aml_compute_clusters : can(regex("^[A-Za-z0-9][A-Za-z0-9-]{1,22}[A-Za-z0-9]$", cluster_name))

cluster_name = "gpu-cluster"
subnet_id = null
validation {
condition = alltrue([
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message claims "2-24 characters" but with the regex fix above the actual enforced range becomes 3-24. The message should be updated to match.

Suggested change
condition = alltrue([
error_message = "aml_compute_clusters keys must be 3-24 characters, start and end with an alphanumeric character, and contain only letters, numbers, and hyphens."

# AzureML compute cluster (when enabled)
terraform output aml_compute_cluster
# AzureML compute clusters keyed by cluster name
terraform output -json aml_compute_clusters | jq -r '."gpu-training".name'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration guide earlier in this file uses gpu-cluster as the example cluster name throughout, but this example references gpu-training. Readers following the migration guide end-to-end will encounter a mismatch. Using a consistent name across all examples in the file would avoid the confusion.

Suggested change
terraform output -json aml_compute_clusters | jq -r '."gpu-training".name'
terraform output -json aml_compute_clusters | jq -r '."gpu-cluster".name'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants