Skip to content

feat(infrastructure): add manage-node-pools script and documentation#548

Open
bindsi wants to merge 12 commits into
mainfrom
feature/manage-node-pools
Open

feat(infrastructure): add manage-node-pools script and documentation#548
bindsi wants to merge 12 commits into
mainfrom
feature/manage-node-pools

Conversation

@bindsi
Copy link
Copy Markdown
Member

@bindsi bindsi commented Apr 24, 2026

feat(infrastructure): add manage-node-pools script for post-deployment pool edits

Description

Adds a script and documentation for adding, removing, or resizing AKS node pools on a running cluster without redeploying infrastructure or the OSMO control plane. The original cluster was sized with 4 vCPU nodes, but an SDG workflow needed more than 6.5 vCPU, and the previous path to fix that was a full reinstall. This PR narrows the blast radius to a single node pool and its subnet by driving changes through Terraform's existing for_each over node_pools, then reconciles OSMO's POD_TEMPLATE, POOL, and BACKEND configs automatically.

Type of Change

  • 🐛 Bug fix (non-breaking change fixing an issue)
  • ✨ New feature (non-breaking change adding functionality)
  • 💥 Breaking change (fix or feature causing existing functionality to change)
  • 📚 Documentation update
  • 🏗️ Infrastructure change (Terraform/IaC)
  • ♻️ Refactoring (no functional changes)

Component(s) Affected

  • infrastructure/terraform/prerequisites/ - Azure subscription setup
  • infrastructure/terraform/ - Terraform infrastructure
  • infrastructure/setup/ - OSMO control plane / Helm
  • workflows/ - Training and evaluation workflows
  • training/ - Training pipelines and scripts
  • docs/ - Documentation

Testing Performed

  • Terraform plan reviewed (no unexpected changes)
  • Terraform apply tested in dev environment
  • Training scripts tested locally with Isaac Sim
  • OSMO workflow submitted successfully
  • Smoke tests passed (smoke_test_azure.py)

Local verification performed:

  • shellcheck passes on infrastructure/setup/optional/manage-node-pools.sh.
  • markdownlint-cli2 passes on both docs/infrastructure/manage-node-pools.md and docs/infrastructure/cluster-setup-advanced.md.
  • bash manage-node-pools.sh list against the current Terraform state returns the existing gpu pool row as expected.
  • bash manage-node-pools.sh --help renders the full usage block.

End-to-end add/remove runs against a live cluster have not been executed in this branch; the boxes above are intentionally left unchecked.

Documentation Impact

  • No documentation changes needed
  • Documentation updated in this PR
  • Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

  • Linked to issue being fixed
  • Regression test included, OR
  • Justification for no regression test:

Checklist

Changes

Script

  • Added infrastructure/setup/optional/manage-node-pools.sh with four subcommands:
    • list prints the current node_pools table (name, VM size, priority, autoscale range, taints).
    • add creates a new pool from CLI flags covering vm-size, subnet, priority, node-count or auto-scale with min-count/max-count, repeatable taint/label/zone, eviction-policy (Spot only), and gpu-driver. Rejects duplicate pool names and validates flag combinations.
    • remove deletes a pool from the overlay and warns when removal empties the map or when DEFAULT_POOL from .env.local matches the pool being removed.
    • sync re-renders OSMO configs without a Terraform apply (useful after manual terraform.tfvars edits).
  • The script maintains a managed overlay at infrastructure/terraform/node-pools.managed.auto.tfvars.json. On first mutation it seeds the overlay by evaluating var.node_pools through terraform console, so Terraform's existing for_each on azurerm_kubernetes_cluster_node_pool, subnets, NSG associations, and NAT gateway associations only touches the added or removed pool.
  • After writing the overlay, the script runs terraform apply -auto-approve (skippable with --skip-apply) and then invokes infrastructure/setup/04-deploy-osmo-backend.sh to regenerate OSMO POD_TEMPLATE, POOL, and BACKEND configs. Operator-supplied flags pass through via --osmo-args so the same auth and ACR settings from the original deploy are preserved.
  • Follows the repo shell-script template: set -o errexit -o nounset, sources scripts/lib/common.sh and defaults.conf, and uses the info/warn/fatal/section/print_kv helpers.

Documentation

  • Added docs/infrastructure/manage-node-pools.md with when-to-use rationale (including the SDG workflow that surfaced this gap), a how-it-works explanation of the overlay and for_each semantics, prerequisites, full flag tables, four worked examples (list, CPU pool for SDG, Spot H100 with autoscaling, remove, sync), verification commands (kubectl get nodes, az aks nodepool list, osmo config show POOL), and operational notes on subnet planning, DEFAULT_POOL drift, overlay-as-source-of-truth, Spot constraints, and autoscaling.
  • Updated docs/infrastructure/cluster-setup-advanced.md to list the new script in the Optional Scripts table with a link to the new page.

Related Issues

None.

Notes

  • The .auto.tfvars.json overlay is not added to .gitignore; operators can either commit it to share pool composition with the team or keep it local alongside terraform.tfvars.
  • Mixing edits between terraform.tfvars and the overlay leads to the overlay winning (Terraform loads *.auto.tfvars* after terraform.tfvars). The new documentation flags this explicitly.

Follow-up Tasks

  • Exercise add and remove end-to-end on a dev cluster and update the Testing Performed checkboxes above.

- implement script for managing AKS node pools
- create documentation for node pool management
- include usage examples and command options

Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA cb330e0.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.65%. Comparing base (cb6cabf) to head (cb330e0).

❌ Your project status has failed because the head coverage (64.51%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #548   +/-   ##
=======================================
  Coverage   88.65%   88.65%           
=======================================
  Files         252      252           
  Lines       18019    18015    -4     
  Branches     2451     2451           
=======================================
- Hits        15974    15971    -3     
+ Misses       1578     1577    -1     
  Partials      467      467           
Flag Coverage Δ *Carryforward flag
pester 83.15% <ø> (+<0.01%) ⬆️ Carriedforward from e9546cb
pytest-data-pipeline 100.00% <ø> (ø) Carriedforward from e9546cb
pytest-dataviewer 93.60% <ø> (ø) Carriedforward from e9546cb
pytest-dm-tools 100.00% <ø> (ø) Carriedforward from e9546cb
pytest-evaluation 99.51% <ø> (ø)
pytest-fuzz 4.89% <ø> (ø) Carriedforward from e9546cb
pytest-inference 100.00% <ø> (ø) Carriedforward from e9546cb
pytest-training 93.32% <ø> (ø) Carriedforward from e9546cb
vitest 86.30% <ø> (ø) Carriedforward from e9546cb
vitest-app 86.30% <ø> (ø) Carriedforward from e9546cb
vitest-components 86.30% <ø> (ø) Carriedforward from e9546cb
vitest-features 86.30% <ø> (ø) Carriedforward from e9546cb
vitest-lib 86.30% <ø> (ø) Carriedforward from e9546cb
vitest-state 86.30% <ø> (ø) Carriedforward from e9546cb

*This pull request uses carry forward flags. Click here to find out more.
see 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread docs/infrastructure/manage-node-pools.md Outdated
Comment thread docs/infrastructure/manage-node-pools.md Outdated
@bindsi bindsi marked this pull request as ready for review May 9, 2026 12:00
@bindsi bindsi requested a review from a team as a code owner May 9, 2026 12:00
Copy link
Copy Markdown
Member Author

@bindsi bindsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed both review comments:

  • VM SKU clarification (line 24): AKS node pools are single-SKU, so a different SKU means a new pool. Updated the "When to Use" section to call this out explicitly and reframed the examples accordingly.
  • --config-preview consistency (line 55): --skip-apply and --config-preview have different semantics — --skip-apply still writes the overlay, while --config-preview in the 01–04 scripts is a true no-mutation preview. Added a real --config-preview flag that prints the resolved config and exits before any overlay write, terraform apply, or OSMO sync. Kept --skip-apply for the mutate-overlay-only workflow and clarified the distinction in both the help text and the docs table.

shellcheck and markdownlint-cli2 both clean.

…g-preview

- Note pools are single-SKU and add/remove are independent operations
- Add --config-preview flag matching 01-04 deploy scripts
- Distinguish --skip-apply (writes overlay) from --config-preview (no mutation)

🤖 Generated by Copilot
Copy link
Copy Markdown
Collaborator

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together — the documentation is excellent and the script follows repo conventions well. A few observations, one architectural and a few smaller items:


⚠️ Overlay introduces drift risk (High)

The managed overlay (node-pools.managed.auto.tfvars.json) creates a dual source-of-truth for node_pools. Because Terraform loads *.auto.tfvars.json after terraform.tfvars, any edits an operator makes to node_pools in terraform.tfvars are silently ignored once the overlay exists.

Concerns:

  1. Terraform already supports this workflow directly. Edit node_pools in terraform.tfvarsterraform planterraform apply04-deploy-osmo-backend.sh. This is the established pattern for every other Terraform-managed resource and requires no new abstraction.
  2. seed_managed_tfvars() forks config permanently. Once it snapshots var.node_pools into the overlay, the overlay becomes sole source of truth. This fork is invisible to anyone looking at terraform.tfvars.
  3. Ambiguous ownership. The overlay is neither .gitignored nor required to be committed — teams will split on whether it's tracked, leading to divergent cluster states across environments.
  4. Future risk. Any script or CI pipeline that touches terraform.tfvars node_pools will silently lose to the overlay without warning.

Suggestion: Consider reducing the script to a thin wrapper that:

  • Validates inputs (the flag-to-JSON translation is genuinely valuable)
  • Prints a ready-to-paste node_pools map entry for terraform.tfvars
  • Optionally chains terraform apply (without -auto-approve) + 04-deploy-osmo-backend.sh

This preserves the UX convenience without introducing a parallel config mechanism. If the overlay approach is kept, at minimum: add it to .gitignore, print a prominent warning when seeded, and remove -auto-approve.


⚠️ terraform apply -auto-approve (Medium)

The deploy scripts (01–04) never run terraform apply themselves — they read from outputs and assume the user applied separately. This script breaks that pattern with -auto-approve, which can apply unrelated drift if state is out of sync. Recommend removing -auto-approve so operators see the full plan before confirming.


💡 terraform console vs read_terraform_outputs (Medium)

All other setup scripts use read_terraform_outputs + tf_get/tf_require. The node_pools value is already exposed as a Terraform output. Using terraform console to evaluate var.node_pools is non-standard and reads the variable definition (pre-apply) rather than applied state. For cmd_list, consider using the output for consistency.


💡 Minor items (Low, non-blocking)

  • --skip-apply naming: The overlay is already written when this flag takes effect — consider --write-overlay-only or printing a warning.
  • OSMO sync scope: 04-deploy-osmo-backend.sh does a full backend cycle (Helm upgrade, tokens, storage), not just pool reconciliation. Worth documenting that sync is heavier than it sounds.
  • --config-preview on list: Silently ignored. Either handle it or note it's not applicable.
  • cspell cleanup: The ~304 entry deduplication is beneficial but undocumented in the PR body — worth a brief mention so reviewers know it's intentional.

zscaler

# Academic & Technical Terms
anonymization
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, big parts of this file are deleting entire categories, intentional?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, I think it was a merge problem...

bindsi and others added 3 commits May 12, 2026 14:32
…only workflow

- Delete optional/manage-node-pools.sh and its overlay file mechanism
- Rewrite manage-node-pools.md as a guide: edit tfvars, terraform apply, rerun 04
- Document ForceNew vs in-place node_pools fields
- Cover resize, add, remove, two-step SKU upgrade, and in-place SKU replace
- Drop script entry from cluster-setup-advanced.md optional scripts table

🤖 Generated by Copilot
Copy link
Copy Markdown
Contributor

@rezatnoMsirhC rezatnoMsirhC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution. The documentation is thorough and the ForceNew field table is genuinely useful.

The PR title and description still describe the shell script (list/add/remove/sync subcommands, the managed overlay file, seed_managed_tfvars(), --skip-apply) that was removed in e9546cb0. The actual diff is documentation-only. Worth updating the title and body to reflect the current scope so git history stays accurate for future contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants