feat(infrastructure): add manage-node-pools script and documentation#548
feat(infrastructure): add manage-node-pools script and documentation#548bindsi wants to merge 12 commits into
Conversation
- implement script for managing AKS node pools - create documentation for node pool management - include usage examples and command options Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (64.51%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #548 +/- ##
=======================================
Coverage 88.65% 88.65%
=======================================
Files 252 252
Lines 18019 18015 -4
Branches 2451 2451
=======================================
- Hits 15974 15971 -3
+ Misses 1578 1577 -1
Partials 467 467
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
bindsi
left a comment
There was a problem hiding this comment.
Addressed both review comments:
- VM SKU clarification (line 24): AKS node pools are single-SKU, so a different SKU means a new pool. Updated the "When to Use" section to call this out explicitly and reframed the examples accordingly.
--config-previewconsistency (line 55):--skip-applyand--config-previewhave different semantics —--skip-applystill writes the overlay, while--config-previewin the 01–04 scripts is a true no-mutation preview. Added a real--config-previewflag that prints the resolved config and exits before any overlay write,terraform apply, or OSMO sync. Kept--skip-applyfor the mutate-overlay-only workflow and clarified the distinction in both the help text and the docs table.
shellcheck and markdownlint-cli2 both clean.
…g-preview - Note pools are single-SKU and add/remove are independent operations - Add --config-preview flag matching 01-04 deploy scripts - Distinguish --skip-apply (writes overlay) from --config-preview (no mutation) 🤖 Generated by Copilot
katriendg
left a comment
There was a problem hiding this comment.
Thanks for putting this together — the documentation is excellent and the script follows repo conventions well. A few observations, one architectural and a few smaller items:
⚠️ Overlay introduces drift risk (High)
The managed overlay (node-pools.managed.auto.tfvars.json) creates a dual source-of-truth for node_pools. Because Terraform loads *.auto.tfvars.json after terraform.tfvars, any edits an operator makes to node_pools in terraform.tfvars are silently ignored once the overlay exists.
Concerns:
- Terraform already supports this workflow directly. Edit
node_poolsinterraform.tfvars→terraform plan→terraform apply→04-deploy-osmo-backend.sh. This is the established pattern for every other Terraform-managed resource and requires no new abstraction. seed_managed_tfvars()forks config permanently. Once it snapshotsvar.node_poolsinto the overlay, the overlay becomes sole source of truth. This fork is invisible to anyone looking atterraform.tfvars.- Ambiguous ownership. The overlay is neither
.gitignored nor required to be committed — teams will split on whether it's tracked, leading to divergent cluster states across environments. - Future risk. Any script or CI pipeline that touches
terraform.tfvarsnode_pools will silently lose to the overlay without warning.
Suggestion: Consider reducing the script to a thin wrapper that:
- Validates inputs (the flag-to-JSON translation is genuinely valuable)
- Prints a ready-to-paste
node_poolsmap entry forterraform.tfvars - Optionally chains
terraform apply(without-auto-approve) +04-deploy-osmo-backend.sh
This preserves the UX convenience without introducing a parallel config mechanism. If the overlay approach is kept, at minimum: add it to .gitignore, print a prominent warning when seeded, and remove -auto-approve.
⚠️ terraform apply -auto-approve (Medium)
The deploy scripts (01–04) never run terraform apply themselves — they read from outputs and assume the user applied separately. This script breaks that pattern with -auto-approve, which can apply unrelated drift if state is out of sync. Recommend removing -auto-approve so operators see the full plan before confirming.
💡 terraform console vs read_terraform_outputs (Medium)
All other setup scripts use read_terraform_outputs + tf_get/tf_require. The node_pools value is already exposed as a Terraform output. Using terraform console to evaluate var.node_pools is non-standard and reads the variable definition (pre-apply) rather than applied state. For cmd_list, consider using the output for consistency.
💡 Minor items (Low, non-blocking)
--skip-applynaming: The overlay is already written when this flag takes effect — consider--write-overlay-onlyor printing a warning.- OSMO sync scope:
04-deploy-osmo-backend.shdoes a full backend cycle (Helm upgrade, tokens, storage), not just pool reconciliation. Worth documenting thatsyncis heavier than it sounds. --config-previewonlist: Silently ignored. Either handle it or note it's not applicable.- cspell cleanup: The ~304 entry deduplication is beneficial but undocumented in the PR body — worth a brief mention so reviewers know it's intentional.
| zscaler | ||
|
|
||
| # Academic & Technical Terms | ||
| anonymization |
There was a problem hiding this comment.
BTW, big parts of this file are deleting entire categories, intentional?
There was a problem hiding this comment.
nope, I think it was a merge problem...
…ools # Conflicts: # .cspell.json # .cspell/general-technical.txt
… MD013 🤖 Generated by Copilot
🤖 Generated by Copilot
🤖 Generated by Copilot
…only workflow - Delete optional/manage-node-pools.sh and its overlay file mechanism - Rewrite manage-node-pools.md as a guide: edit tfvars, terraform apply, rerun 04 - Document ForceNew vs in-place node_pools fields - Cover resize, add, remove, two-step SKU upgrade, and in-place SKU replace - Drop script entry from cluster-setup-advanced.md optional scripts table 🤖 Generated by Copilot
rezatnoMsirhC
left a comment
There was a problem hiding this comment.
Thanks for this contribution. The documentation is thorough and the ForceNew field table is genuinely useful.
The PR title and description still describe the shell script (list/add/remove/sync subcommands, the managed overlay file, seed_managed_tfvars(), --skip-apply) that was removed in e9546cb0. The actual diff is documentation-only. Worth updating the title and body to reflect the current scope so git history stays accurate for future contributors.
feat(infrastructure): add manage-node-pools script for post-deployment pool edits
Description
Adds a script and documentation for adding, removing, or resizing AKS node pools on a running cluster without redeploying infrastructure or the OSMO control plane. The original cluster was sized with 4 vCPU nodes, but an SDG workflow needed more than 6.5 vCPU, and the previous path to fix that was a full reinstall. This PR narrows the blast radius to a single node pool and its subnet by driving changes through Terraform's existing
for_eachovernode_pools, then reconciles OSMO's POD_TEMPLATE, POOL, and BACKEND configs automatically.Type of Change
Component(s) Affected
infrastructure/terraform/prerequisites/- Azure subscription setupinfrastructure/terraform/- Terraform infrastructureinfrastructure/setup/- OSMO control plane / Helmworkflows/- Training and evaluation workflowstraining/- Training pipelines and scriptsdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Local verification performed:
shellcheckpasses on infrastructure/setup/optional/manage-node-pools.sh.markdownlint-cli2passes on both docs/infrastructure/manage-node-pools.md and docs/infrastructure/cluster-setup-advanced.md.bash manage-node-pools.sh listagainst the current Terraform state returns the existinggpupool row as expected.bash manage-node-pools.sh --helprenders the full usage block.End-to-end
add/removeruns against a live cluster have not been executed in this branch; the boxes above are intentionally left unchecked.Documentation Impact
Bug Fix Checklist
Complete this section for bug fix PRs. Skip for other contribution types.
Checklist
Changes
Script
listprints the currentnode_poolstable (name, VM size, priority, autoscale range, taints).addcreates a new pool from CLI flags covering vm-size, subnet, priority, node-count or auto-scale with min-count/max-count, repeatable taint/label/zone, eviction-policy (Spot only), and gpu-driver. Rejects duplicate pool names and validates flag combinations.removedeletes a pool from the overlay and warns when removal empties the map or whenDEFAULT_POOLfrom .env.local matches the pool being removed.syncre-renders OSMO configs without a Terraform apply (useful after manual terraform.tfvars edits).var.node_poolsthroughterraform console, so Terraform's existingfor_eachonazurerm_kubernetes_cluster_node_pool, subnets, NSG associations, and NAT gateway associations only touches the added or removed pool.terraform apply -auto-approve(skippable with--skip-apply) and then invokes infrastructure/setup/04-deploy-osmo-backend.sh to regenerate OSMO POD_TEMPLATE, POOL, and BACKEND configs. Operator-supplied flags pass through via--osmo-argsso the same auth and ACR settings from the original deploy are preserved.set -o errexit -o nounset, sourcesscripts/lib/common.shanddefaults.conf, and uses theinfo/warn/fatal/section/print_kvhelpers.Documentation
for_eachsemantics, prerequisites, full flag tables, four worked examples (list, CPU pool for SDG, Spot H100 with autoscaling, remove, sync), verification commands (kubectl get nodes,az aks nodepool list,osmo config show POOL), and operational notes on subnet planning,DEFAULT_POOLdrift, overlay-as-source-of-truth, Spot constraints, and autoscaling.Related Issues
None.
Notes
.auto.tfvars.jsonoverlay is not added to.gitignore; operators can either commit it to share pool composition with the team or keep it local alongsideterraform.tfvars.*.auto.tfvars*afterterraform.tfvars). The new documentation flags this explicitly.Follow-up Tasks
addandremoveend-to-end on a dev cluster and update the Testing Performed checkboxes above.