feat!(infrastructure): support configurable Azure ML compute clusters#687
feat!(infrastructure): support configurable Azure ML compute clusters#687fbeltrao wants to merge 3 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (64.51%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #687 +/- ##
=======================================
Coverage 88.65% 88.65%
=======================================
Files 252 252
Lines 18018 18018
Branches 2451 2451
=======================================
Hits 15974 15974
Misses 1577 1577
Partials 467 467
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
- replace single compute configuration with a map for multiple clusters - update validation rules for cluster names and properties - modify outputs to reflect new cluster structure - enhance tests to cover new cluster configurations 🔧 - Generated by Copilot
…ple cluster - Remove override_during = plan from conditionals.tftest.hcl mock providers to match bare mock_provider style used in defaults, outputs, validation tests - Change aml_compute_clusters example to empty map with cluster entry commented out for safe copy-paste behavior matching prior should_deploy_aml_compute style 🤖 - Generated by Copilot
4b68d39 to
7e73d3e
Compare
katriendg
left a comment
There was a problem hiding this comment.
Thanks for the thorough work here — the for_each refactoring, validation rules, and test coverage are all well-structured. A few items to address before merge, mostly around the breaking change lifecycle and migration safety.
1. Breaking change commit convention
The PR correctly marks 💥 Breaking change in the description, but release-please needs either a BREAKING CHANGE: footer in the squash-merge commit or a feat! type suffix to generate the ⚠ BREAKING CHANGES section in CHANGELOG.md and trigger the correct version bump.
Before merge, please add a BREAKING CHANGE: block to the PR body so it flows into the squash-merge commit:
BREAKING CHANGE: The `should_deploy_aml_compute` boolean and `aml_compute_config` object variables
are replaced by `aml_compute_clusters`, a map of named cluster definitions. The `aml_compute_cluster`
output (singular) is replaced by `aml_compute_clusters` (map keyed by cluster name). Existing
deployments require variable migration and `terraform state mv` to avoid cluster recreation.
When merging, use feat!(infrastructure): support configurable Azure ML compute clusters as the squash-merge commit subject, or ensure the BREAKING CHANGE: footer from the PR body is preserved.
2. Migration guide for existing deployments
Users with existing should_deploy_aml_compute = true and aml_compute_config in their terraform.tfvars will hit immediate Terraform errors after pulling this change. The PR should include migration guidance covering:
Variable replacement in terraform.tfvars:
# Before
should_deploy_aml_compute = true
aml_compute_config = {
vm_size = "Standard_NC4as_T4_v3"
vm_priority = "LowPriority"
min_node_count = 0
max_node_count = 1
scale_down_after_idle = "PT5M"
cluster_name = "gpu-cluster"
}
# After
aml_compute_clusters = {
"gpu-cluster" = {
vm_size = "Standard_NC4as_T4_v3"
vm_priority = "LowPriority"
min_node_count = 0
max_node_count = 1
scale_down_after_idle = "PT5M"
}
}Terraform state migration (critical — prevents cluster destruction):
terraform state mv \
'module.platform.azurerm_machine_learning_compute_cluster.gpu[0]' \
'module.platform.azurerm_machine_learning_compute_cluster.gpu["gpu-cluster"]'Without this command, terraform plan will show a destroy + create, causing downtime and potential training job failures.
Output consumption update:
# Before
terraform output aml_compute_cluster
# After
terraform output -json aml_compute_clusters | jq '."gpu-cluster"'Consider adding this guidance to the PR description, the infrastructure docs, or both.
3. State migration documentation convention
The count → for_each change means gpu[0] and gpu["key"] are different Terraform state addresses. Consider adding a > [!WARNING] callout to the infrastructure reference docs noting the one-time state migration requirement. This establishes a documentation convention for future count-to-for_each migrations.
| node_public_ip_enabled = coalesce(cluster.node_public_ip_enabled, false) | ||
| ssh_public_access_enabled = coalesce(cluster.ssh_public_access_enabled, false) | ||
| identity_type = coalesce(cluster.identity_type, "UserAssigned") | ||
| identity_ids = coalesce(cluster.identity_type, "UserAssigned") == "UserAssigned" ? [local.aml_user_assigned_identity_id] : null |
There was a problem hiding this comment.
Identity type default changed from SystemAssigned to UserAssigned
The old resource hardcoded identity { type = "SystemAssigned" }. This now defaults to "UserAssigned" with the platform managed identity when identity_type is omitted. This is an additional semantic breaking change worth calling out in the PR description and migration guide.
Existing clusters with SystemAssigned identity will see their identity type changed on next apply, which may trigger resource recreation depending on Azure provider behavior.
Suggestion for migration guide:
If your existing cluster uses SystemAssigned identity (the previous default), add
`identity_type = "SystemAssigned"` to your cluster definition to preserve current behavior.
The new default is UserAssigned with the platform managed identity.
There was a problem hiding this comment.
I called out the identity default change in the PR migration guidance and documented the preservation path for existing deployments. The migration notes now explicitly say to set identity_type to SystemAssigned to keep prior behavior, and clarify that omitting it uses the new UserAssigned default with the platform managed identity.
| ]) | ||
| error_message = "aml_compute_clusters identity_type values must be either SystemAssigned or UserAssigned." | ||
| } | ||
| } |
There was a problem hiding this comment.
Consider adding ISO 8601 validation for scale_down_after_idle
The other fields in this variable have thorough validation — scale_down_after_idle accepts any string without checks. A malformed value (e.g., "5 minutes" instead of "PT5M") would only fail at Azure API apply time with a cryptic error.
| } | |
| error_message = "aml_compute_clusters identity_type values must be either SystemAssigned or UserAssigned." | |
| } | |
| validation { | |
| condition = alltrue([ | |
| for _, cluster in var.aml_compute_clusters : can(regex("^P(T\\d+[HMS])+$", cluster.scale_down_after_idle)) | |
| ]) | |
| error_message = "aml_compute_clusters scale_down_after_idle must be an ISO 8601 duration (e.g., PT5M, PT30S, PT1H)." | |
| } |
There was a problem hiding this comment.
Added validation for scale_down_after_idle so malformed values fail during Terraform validation instead of later at Azure apply time. It now enforces ISO 8601 duration values and includes examples in the validation error message.
… validation - add migration instructions for transitioning to new compute cluster settings - introduce validation for scale_down_after_idle parameter in AML compute clusters 🔧 - Generated by Copilot
| })) | ||
| description = "AzureML managed compute clusters keyed by Azure ML compute cluster name. Empty map deploys no clusters." | ||
| default = {} | ||
|
|
There was a problem hiding this comment.
The name validation regex allows 2-character cluster names because {0,22} permits an empty middle segment (1 + 0 + 1 = 2 character minimum). Azure ML compute cluster names require a minimum of 3 characters, so a name like ab passes this validation but fails at apply time with an Azure API error.
Change {0,22} to {1,22} to match the Azure minimum (3 chars), and update the error message to reflect the correct range. The same fix is needed in the root infrastructure/terraform/variables.tf validation block.
| for cluster_name, _ in var.aml_compute_clusters : can(regex("^[A-Za-z0-9][A-Za-z0-9-]{1,22}[A-Za-z0-9]$", cluster_name)) |
| cluster_name = "gpu-cluster" | ||
| subnet_id = null | ||
| validation { | ||
| condition = alltrue([ |
There was a problem hiding this comment.
The error message claims "2-24 characters" but with the regex fix above the actual enforced range becomes 3-24. The message should be updated to match.
| condition = alltrue([ | |
| error_message = "aml_compute_clusters keys must be 3-24 characters, start and end with an alphanumeric character, and contain only letters, numbers, and hyphens." |
| # AzureML compute cluster (when enabled) | ||
| terraform output aml_compute_cluster | ||
| # AzureML compute clusters keyed by cluster name | ||
| terraform output -json aml_compute_clusters | jq -r '."gpu-training".name' |
There was a problem hiding this comment.
The migration guide earlier in this file uses gpu-cluster as the example cluster name throughout, but this example references gpu-training. Readers following the migration guide end-to-end will encounter a mismatch. Using a consistent name across all examples in the file would avoid the confusion.
| terraform output -json aml_compute_clusters | jq -r '."gpu-training".name' | |
| terraform output -json aml_compute_clusters | jq -r '."gpu-cluster".name' |
Description
Partially addresses #384 (flexible AML compute clusters)
Replace the single Azure ML compute-cluster interface with a flexible
aml_compute_clustersmap. The Terraform root module now passes a map of named compute cluster definitions intomodule.platform, and the platform module creates zero, one, or many Azure ML compute clusters withfor_eachinstead of a single optional resource.Each compute cluster now supports independent settings for VM size, priority, autoscaling bounds, idle scale-down, subnet attachment, public IP exposure, SSH public access, identity type, and optional location. The module continues to block subnet attachment when AML managed network isolation is enabled, but now applies that validation to every configured cluster instead of one shared object.
This PR also migrates outputs from a singular
aml_compute_clustervalue to anaml_compute_clustersmap, updates example tfvars and Terraform reference docs to the new configuration shape, and expands Terraform tests to cover defaults, conditionals, validation, and outputs for the multi-cluster deployment path.This change is breaking for Terraform consumers because the previous
should_deploy_aml_computeandaml_compute_configinputs are replaced byaml_compute_clusters, and callers must update any code that reads the old singular output.Migration Guide
Existing deployments that used
should_deploy_aml_compute = trueandaml_compute_configmust migrate toaml_compute_clustersbefore applying this change.Set
identity_type = "SystemAssigned"to preserve the previous default compute-cluster identity behavior. Omitidentity_typeto use the new default,UserAssigned, with the platform managed identity.Move existing Terraform state before applying to avoid destroying and recreating the cluster:
Update output consumers from
terraform output aml_compute_clusterto the new keyed map output:Type of Change
Component(s) Affected
infrastructure/terraform/prerequisites/- Azure subscription setupinfrastructure/terraform/- Terraform infrastructureinfrastructure/setup/- OSMO control plane / Helmworkflows/- Training and evaluation workflowstraining/- Training pipelines and scriptsdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Notes:
terraform planorterraform applyexecution because those results were not verified while preparing the tracking artifact.Documentation Impact
Bug Fix Checklist
Complete this section for bug fix PRs. Skip for other contribution types.
Checklist
BREAKING CHANGE: Replaces
should_deploy_aml_compute,aml_compute_config, and the singularaml_compute_clusteroutput with theaml_compute_clustersmap. Existing deployments must update variables, move Terraform state fromgpu[0]to the keyed cluster address, and setidentity_type = "SystemAssigned"to preserve the previous compute-cluster identity behavior.