diff --git a/0-Energy_Hardware_Optimization.md b/0-Energy_Hardware_Optimization.md
index 70a490a..f2c646d 100644
--- a/0-Energy_Hardware_Optimization.md
+++ b/0-Energy_Hardware_Optimization.md
@@ -270,7 +270,7 @@ az aks update -g ResourceGroup -n EfficientAKS --enable-cluster-autoscaler --min
-

+
Refresh Date: 2025-08-20
diff --git a/1-Cloud_k8s_Scaling.md b/1-Cloud_k8s_Scaling.md
index 846721e..4ad261e 100644
--- a/1-Cloud_k8s_Scaling.md
+++ b/1-Cloud_k8s_Scaling.md
@@ -1285,7 +1285,7 @@ foreach ($vm in $vmList) {
-

+
Refresh Date: 2025-08-20
diff --git a/2-Azure_Stack_Architecture.md b/2-Azure_Stack_Architecture.md
index 89a12ac..37fd44c 100644
--- a/2-Azure_Stack_Architecture.md
+++ b/2-Azure_Stack_Architecture.md
@@ -1065,7 +1065,7 @@ Layer Interactions:
-

+
Refresh Date: 2025-08-20
diff --git a/3-Azure_Capacity_Challenges.md b/3-Azure_Capacity_Challenges.md
index 8b13789..79677ce 100644
--- a/3-Azure_Capacity_Challenges.md
+++ b/3-Azure_Capacity_Challenges.md
@@ -1 +1,629 @@
+# Azure Capacity Challenges — Causes, Signals, and Proactive Strategies
+[](https://github.com)
+[](https://github.com/)
+[brown9804](https://github.com/brown9804)
+
+Last updated: 2025-08-20
+
+-----------------------------
+
+> This community demo is for learning only and uses public documentation. It blends theory and practical examples (no cloud sign-in required). For production guidance, cost/security/compliance, and Azure-specific deployment patterns, contact Microsoft directly: [Microsoft Sales and Support](https://support.microsoft.com/contactus?ContactUsExperienceEntryPointAssetId=S.HP.SMC-HOME)
+
+
+List of References (Click to expand)
+
+- Azure status and service health
+ - https://status.azure.com
+ - https://learn.microsoft.com/azure/service-health/overview
+- Azure regional services and availability
+ - https://azure.microsoft.com/global-infrastructure/services/
+ - https://learn.microsoft.com/azure/availability-zones/az-overview
+- VM sizes, SKUs, and quotas
+ - https://learn.microsoft.com/azure/virtual-machines/sizes
+ - https://learn.microsoft.com/azure/quotas/quotas-overview
+ - https://learn.microsoft.com/azure/quotas/per-vm-family-quota-requests
+- Capacity error patterns and mitigations
+ - https://learn.microsoft.com/azure/azure-resource-manager/troubleshooting/error-codes
+ - https://learn.microsoft.com/azure/virtual-machines/troubleshooting/allocation-failure
+- Reservations, savings plans, and scale sets
+ - https://learn.microsoft.com/azure/cost-management-billing/reservations/save-compute-costs-reservations
+ - https://learn.microsoft.com/azure/virtual-machine-scale-sets/overview
+- AKS scaling and schedulability
+ - https://learn.microsoft.com/azure/aks/cluster-autoscaler
+ - https://learn.microsoft.com/azure/aks/start-stop-cluster
+- Storage and networking capacity
+ - https://learn.microsoft.com/azure/storage/common/scalability-targets-standard-account
+ - https://learn.microsoft.com/azure/azure-resource-manager/management/azure-subscription-service-limits
+- Azure Advisor and capacity planning
+ - https://learn.microsoft.com/azure/advisor/advisor-overview
+- Workload identity and regional expansion
+ - https://learn.microsoft.com/azure/reliability/cross-region-replication-azure
+
+
+
+
+Table of Contents (Click to expand)
+
+- [What are Azure Capacity Challenges?](#what-are-azure-capacity-challenges)
+- [Why capacity constraints happen](#why-capacity-constraints-happen)
+- [Common signals and error codes](#common-signals-and-error-codes)
+- [Proactive planning and design](#proactive-planning-and-design)
+- [Operational playbooks (runbooks)](#operational-playbooks-runbooks)
+- [Automation examples (CLI/PowerShell/Bicep/KQL)](#automation-examples-clipowershellbicepkql)
+- [Error-to-Action mapping](#error-to-action-mapping)
+- [Alerting and auto-remediation](#alerting-and-auto-remediation)
+- [CI/CD gates and policy guardrails](#cicd-gates-and-policy-guardrails)
+- [Quota-as-Code automation](#quota-as-code-automation)
+- [Policy config schema (region/SKU/quotas)](#policy-config-schema-regionskuquotas)
+- [IaC quickstart: Action Group + Alerts + Logic App](#iac-quickstart-action-group--alerts--logic-app)
+- [SKU/Region fallback playbook](#skuregion-fallback-playbook)
+- [AKS- and PaaS-specific guidance](#aks--and-paas-specific-guidance)
+- [Testing, drill, and validation](#testing-drill-and-validation)
+- [Cost, reservations, and risk trade-offs](#cost-reservations-and-risk-trade-offs)
+- [Checklist](#checklist)
+
+
+
+> Capacity issues in Azure surface in two broad buckets: quota (soft) limits and physical capacity (hard) constraints. Effective designs anticipate both, offer SKU/region flexibility, and automate detection, fallback, and escalation.
+
+## What are Azure Capacity Challenges?
+
+- Soft constraints: subscription/resource quotas (per-VM family cores, public IPs, NICs, vCPU per region, AKS node pools, etc.)
+- Hard constraints: regional/AZ scarcity of specific SKUs, ephemeral capacity during incidents, or burst demand (e.g., GPUs)
+- Scope: region-level, zone-level, cluster/rack-level, or specific hardware features (e.g., Ultra Disk, GPUs, NVMe)
+
+
+Capacity risk scenarios
+
+- New region or AZ not yet enabled for a service/SKU
+- Hot SKU (e.g., GPUs, Premium SSD v2, Ultra Disk) in short supply
+- Highly constrained shapes (large RAM/CPU, confidential computing)
+- Scale-out during an incident or global event
+- Zonal pinning creating skew (all demand in a single AZ)
+- Strict placement policies (PPG/availability sets) limiting allocatable hosts
+
+
+
+## Why capacity constraints happen
+
+- Demand spikes: seasonal events, marketing launches, or incident-induced migrations
+- Hardware specialization: GPUs/NPUs or Ultra Disk clusters are finite per region/AZ
+- Zonal affinity: all workloads targeting one zone
+- Fixed regional envelopes: datacenter lead times vs. sudden growth
+- SKU features mismatch: requiring features not present in selected region/zone
+- Quota not aligned: per-VM family vCPU not raised ahead of scale
+
+
+Preventable causes and anti-patterns
+
+- Single-region dependency without failover
+- Tightly constrained SKU choices (one exact size) with no fallbacks
+- Overuse of proximity placement groups beyond strict latency needs
+- Ignoring per-family quotas during IaC rollouts
+- Fixed zonal mappings without elasticity
+- Manual-only escalation for quota increases
+
+
+
+## Common signals and error codes
+
+- AllocationFailure: The requested VM size/zone/region currently cannot be allocated
+- OverconstrainedAllocationRequest / ZonalAllocationFailed: constraints prevent placement
+- QuotaExceeded: Subscription or per-VM-family quota insufficient
+- OperationNotAllowed: Service limit reached (e.g., IPs, NICs, disks)
+- SKUNotAvailable: Size not available in selected region/zone
+- InsufficientMemory/InsufficientCores (service-specific messages)
+
+
+How to confirm and triage
+
+- Check Service Health and Resource Health for regional advisories
+- Query Activity Logs for failed deployments and error codes
+- Use What-If before large template rollouts to detect quota gaps
+- Attempt allocation in alternate zone or region to isolate scope
+- Validate SKU availability programmatically
+
+
+
+## Proactive planning and design
+
+- Multi-AZ and multi-region ready: design for N+1 regions with active/active or active/passive
+- SKU flexibility: define a prioritized list of sizes per workload class
+- Region flexibility: primary/secondary/tertiary region matrix, aligned to data residency
+- Zonal elasticity: allow any-of AZs unless strict locality is required
+- Quota-as-code: pre-raise quotas in pipelines; track as configuration
+- Use scale sets with mixed or flexible orchestration modes
+- Reservations/Savings Plans for steady base; burst on-demand
+- For GPUs or Ultra Disk: pre-provision warm capacity with health checks
+
+
+Architecture patterns
+
+- Active/Active with Front Door or Traffic Manager across 2+ paired regions
+- VMSS Flexible Orchestration with multiple SKUs in priority order
+- AKS multiple node pools with alternative VM sizes and zones
+- Stateless app tier and stateful data layer with geo-replication (ZRS/GRS, AG listener, Cosmos DB multi-region)
+- Deployment rings: canary → regional → multi-region
+- Feature flags to toggle region or SKU at runtime
+
+
+
+## Operational playbooks (runbooks)
+
+- Detect: Monitor allocation failures and quota nearing thresholds
+- Decide: Auto-select alternative AZ/SKU/region according to policy
+- Do: Retry with relaxed constraints; escalate quota requests automatically
+- Document: Log incidents, annotate cost/latency impacts, and update the allowlists
+
+
+Runbook examples
+
+- VMSS scale-out fails with AllocationFailure → retry with next SKU; if repeat, pick next AZ; if repeat, shift to secondary region
+- AKS pending pods due to unschedulable nodes → enable/verify Cluster Autoscaler; add alt-size node pool; temporarily taint/cordon and drain
+- QuotaExceeded detected in What-If → programmatically raise per-family quota and block merge until approved
+- GPU scarcity → shift batch/training to Batch with low-priority or spot VMs in alternate region; queue jobs
+
+
+
+## Automation examples (CLI/PowerShell/Bicep/KQL)
+
+- List available VM sizes/SKUs by region
+
+```powershell
+# PowerShell
+Get-AzVMSize -Location eastus | Sort-Object Name | Select-Object -First 10
+```
+
+```json
+// Bicep (snippet) - VMSS Flexible with multiple SKUs
+// Note: illustrative snippet; adapt to your module style
+```
+
+```bicep
+param location string = resourceGroup().location
+param skuPrimary string = 'Standard_D4s_v5'
+param skuAlt string = 'Standard_D2s_v5'
+
+resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2024-03-01' = {
+ name: 'web-flex'
+ location: location
+ sku: {
+ name: skuPrimary
+ capacity: 2
+ }
+ properties: {
+ orchestrationMode: 'Flexible'
+ upgradePolicy: { mode: 'Rolling' }
+ virtualMachineProfile: {
+ priorityMixPolicy: {
+ baseRegularPriorityCount: 2
+ }
+ osProfile: {
+ computerNamePrefix: 'web'
+ adminUsername: 'azureuser'
+ }
+ storageProfile: {
+ imageReference: {
+ publisher: 'Canonical'
+ offer: '0001-com-ubuntu-server-jammy'
+ sku: '22_04-lts'
+ version: 'latest'
+ }
+ }
+ networkProfile: {
+ networkInterfaceConfigurations: [
+ {
+ name: 'nic'
+ properties: {
+ primary: true
+ ipConfigurations: [{ name: 'ipconfig' }]
+ }
+ }
+ ]
+ }
+ }
+ }
+}
+
+// Alternate SKU VM resource to join VMSS Flex as instance
+resource vmAlt 'Microsoft.Compute/virtualMachines@2024-03-01' = {
+ name: 'web-alt-001'
+ location: location
+ properties: {
+ virtualMachineScaleSet: {
+ id: vmss.id
+ }
+ hardwareProfile: {
+ vmSize: skuAlt
+ }
+ storageProfile: {
+ imageReference: {
+ publisher: 'Canonical'
+ offer: '0001-com-ubuntu-server-jammy'
+ sku: '22_04-lts'
+ version: 'latest'
+ }
+ }
+ osProfile: {
+ computerName: 'web-alt-001'
+ adminUsername: 'azureuser'
+ linuxConfiguration: { disablePasswordAuthentication: true }
+ }
+ networkProfile: {
+ networkInterfaces: [
+ {
+ id: resourceId('Microsoft.Network/networkInterfaces', 'nic-web-alt-001')
+ properties: { primary: true }
+ }
+ ]
+ }
+ }
+}
+```
+
+- Query allocation failures and quotas in Activity Logs and Azure Monitor
+
+```kql
+// Activity Logs: VM allocation failures last 24h
+AzureActivity
+| where TimeGenerated > ago(24h)
+| where OperationNameValue has 'write' and ActivityStatusValue == 'Failed'
+| where Properties has_any ('AllocationFailure','Overconstrained','SKUNotAvailable','QuotaExceeded')
+| project TimeGenerated, ResourceGroup, Resource, OperationNameValue, ActivityStatusValue, Properties
+```
+
+- Programmatically request quota increases
+
+```powershell
+# Example: Increase vCPU per-VM family quota
+# Note: Use Az.Quota cmdlets when available in your environment
+# Fallback to Azure Portal or REST API for specific providers if needed
+```
+
+- Validate SKU availability via CLI
+
+```powershell
+# Azure CLI in PowerShell shell
+az vm list-skus --location eastus --output table | Select-String D4s_v5
+```
+
+## Error-to-Action mapping
+
+- AllocationFailure
+ - Immediate: Retry with next allowed SKU (same region, any AZ) via VMSS Flex or parameterized IaC
+ - Next: Try alternative AZ; if still failing, try paired/secondary region
+ - Follow-up: Open capacity ticket only if pattern persists across AZs/regions; enrich with Activity Log evidence
+
+- SKUNotAvailable
+ - Immediate: Switch to nearest-performance SKU or adjacent family (e.g., Dv5 ↔ Ev5) from an approved allowlist
+ - Next: Check region availability list; move only burst capacity when possible
+ - Follow-up: Update allowlist; revisit reservations/savings plans to align with observed availability
+
+- QuotaExceeded
+ - Immediate: Re-balance to other families/regions or temporarily cap scale-out
+ - Next: Auto-raise per-VM-family vCPU quota with approval workflow; block rollout until raised
+ - Follow-up: Increase proactive thresholds; embed What-If gates in CI
+
+- OverconstrainedAllocationRequest / ZonalAllocationFailed
+ - Immediate: Relax constraints (allow any-of zones; remove non-critical PPG)
+ - Next: Add alternate SKU or region; retry with wider placement
+ - Follow-up: Document minimal viable constraints in design
+
+## Alerting and auto-remediation
+
+- Alerts to create
+ - Activity log alert: AllocationFailure / SKUNotAvailable / QuotaExceeded events
+ - Metric alerts: VMSS pending instances, AKS Pending pods > N for M minutes
+ - Service Health: Regional capacity advisories for target regions
+
+- KQL alert (activity failures)
+```kql
+AzureActivity
+| where TimeGenerated > ago(15m)
+| where ActivityStatusValue == 'Failed'
+| where Properties has_any ('AllocationFailure','SKUNotAvailable','QuotaExceeded','Overconstrained')
+| summarize failures = count() by bin(TimeGenerated, 5m)
+| where failures > 0
+```
+
+- Auto-remediation patterns
+ - Logic App/Function: on alert, re-deploy with next SKU/AZ, or create quota request; attach incident context
+ - Pipeline gate: block infra rollout if capacity alerts fired in last 30 minutes
+ - Ticketing integration: create/route incident with runbook decision tree
+
+## CI/CD gates and policy guardrails
+
+- Pipeline gates
+ - Deployment What-If on all infra PRs; fail when QuotaExceeded is predicted
+ - SKU availability probe per target region before rollout
+ - Require populated fallback parameters (alt SKUs, secondary region)
+
+- Policy guardrails (examples)
+ - Deny disallowed SKUs; Audit PPG usage unless tag reason is present
+ - Require minRegions >= 2 for tier-X services
+ - Enforce tags: region-priority, sku-allowlist-version
+
+## Quota-as-Code automation
+
+- Desired state approach
+ - Track per-VM-family vCPU quotas by region in config (YAML/JSON)
+ - Pipeline reconciles desired vs actual and raises requests ahead of scale events
+
+- Example outline (PowerShell pseudocode)
+```powershell
+$desired = @(
+ @{ region='eastus'; family='Dsv5'; vcpus=200 },
+ @{ region='eastus2'; family='Dsv5'; vcpus=200 }
+)
+foreach ($q in $desired) {
+ # Get current quota for $q.family in $q.region
+ # If current < $q.vcpus → submit quota increase request and notify approvers
+}
+```
+
+## Policy config schema (region/SKU/quotas)
+
+- Minimal, repo-friendly schema to drive fallback, placement, and quota reconciliation.
+
+```yaml
+# policy.yaml
+version: 1
+policy:
+ regionPriority:
+ - eastus
+ - eastus2
+ - centralus
+ skuAllowlist:
+ - Standard_D4s_v5
+ - Standard_D2s_v5
+ - Standard_E4s_v5
+ constraints:
+ zones: any
+ requireAcceleratedNetworking: true
+ quotas:
+ compute:
+ Dsv5:
+ eastus: 200
+ eastus2: 200
+ publicIps:
+ eastus: 100
+alerts:
+ allocationFailure:
+ window: PT15M
+ threshold: 1
+```
+
+```json
+{
+ "version": 1,
+ "policy": {
+ "regionPriority": ["eastus", "eastus2", "centralus"],
+ "skuAllowlist": ["Standard_D4s_v5", "Standard_D2s_v5", "Standard_E4s_v5"],
+ "constraints": {
+ "zones": "any",
+ "requireAcceleratedNetworking": true
+ },
+ "quotas": {
+ "compute": { "Dsv5": { "eastus": 200, "eastus2": 200 } },
+ "publicIps": { "eastus": 100 }
+ }
+ },
+ "alerts": { "allocationFailure": { "window": "PT15M", "threshold": 1 } }
+}
+```
+
+Guidance
+- Source-control the schema; bump version when policy changes. Validate in CI before deploys.
+- Feed this into your quota reconciler and your fallback selector to keep behaviors consistent.
+
+## IaC quickstart: Action Group + Alerts + Logic App
+
+- Deploys: Action Group (common schema enabled), Activity Log Alert for capacity errors, KQL Log Alert, and a Logic App (Consumption) that receives the alert via webhook to trigger an automated fallback/runbook.
+- API versions aligned with current schemas: actionGroups@2023-01-01, activityLogAlerts@2020-10-01, scheduledQueryRules@2023-12-01, logic/workflows@2019-05-01.
+
+```bicep
+// params
+param location string = resourceGroup().location
+param actionGroupName string = 'cap-alerts-ag'
+param actionGroupShort string = 'capag'
+param lawResourceId string // Log Analytics workspace resourceId for KQL alert scopes
+
+// Logic App (Consumption) with an HTTP trigger named 'manual'
+resource wf 'Microsoft.Logic/workflows@2019-05-01' = {
+ name: 'cap-fallback-la'
+ location: location
+ properties: {
+ state: 'Enabled'
+ definition: {
+ '$schema': 'https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#'
+ 'contentVersion': '1.0.0.0'
+ 'parameters': {}
+ 'triggers': {
+ 'manual': {
+ 'type': 'Request',
+ 'kind': 'Http',
+ 'inputs': {
+ 'schema': {}
+ }
+ }
+ }
+ 'actions': {
+ 'DecideAndInvoke': {
+ 'type': 'Http',
+ 'inputs': {
+ 'method': 'POST',
+ // TODO: replace with your pipeline/runbook endpoint
+ 'uri': 'https://example.com/fallback-run',
+ 'headers': {
+ 'Content-Type': 'application/json'
+ },
+ 'body': {
+ 'alert': "@{triggerBody()}"
+ }
+ }
+ }
+ },
+ 'outputs': {}
+ }
+ }
+}
+
+// Build Logic App trigger callback URL for Action Group receiver
+var wfTriggerCallback = listCallbackUrl(resourceId('Microsoft.Logic/workflows/triggers', wf.name, 'manual'), '2019-05-01').value
+
+// Action Group with Logic App receiver (Common Alert Schema recommended)
+resource ag 'Microsoft.Insights/actionGroups@2023-01-01' = {
+ name: actionGroupName
+ location: 'global'
+ properties: {
+ enabled: true
+ groupShortName: actionGroupShort
+ logicAppReceivers: [
+ {
+ name: 'cap-fallback-la'
+ resourceId: wf.id
+ callbackUrl: wfTriggerCallback
+ useCommonAlertSchema: true
+ }
+ ]
+ }
+}
+
+// Activity Log Alert for capacity-related failures
+resource ala 'Microsoft.Insights/activityLogAlerts@2020-10-01' = {
+ name: 'cap-activity-alert'
+ location: 'global'
+ properties: {
+ enabled: true
+ scopes: [ subscription().id ]
+ condition: {
+ allOf: [
+ { field: 'status', equals: 'Failed' }
+ { field: 'category', equals: 'Administrative' }
+ // Match common capacity errors embedded in properties
+ { field: 'properties', containsAny: [ 'AllocationFailure', 'SKUNotAvailable', 'QuotaExceeded', 'Overconstrained' ] }
+ ]
+ }
+ actions: {
+ actionGroups: [ { actionGroupId: ag.id } ]
+ }
+ description: 'Route capacity allocation/quota failures to Logic App for auto-remediation.'
+ }
+}
+
+// Scheduled Query (KQL) Alert over Activity Logs (or over AzureActivity in LAW)
+resource kql 'Microsoft.Insights/scheduledQueryRules@2023-12-01' = {
+ name: 'cap-kql-alert'
+ location: location
+ properties: {
+ enabled: true
+ displayName: 'Capacity allocation failures (KQL)'
+ description: 'Detect allocation/quota failures via KQL and invoke action group.'
+ severity: 2
+ evaluationFrequency: 'PT5M'
+ windowSize: 'PT15M'
+ criteria: {
+ allOf: [
+ {
+ query: '''
+AzureActivity
+| where TimeGenerated > ago(15m)
+| where ActivityStatusValue == "Failed"
+| where Properties has_any ("AllocationFailure","SKUNotAvailable","QuotaExceeded","Overconstrained")
+| summarize failures = count()
+'''
+ timeAggregation: 'Count'
+ operator: 'GreaterThan'
+ threshold: 0
+ }
+ ]
+ }
+ scopes: [ lawResourceId ]
+ actions: {
+ actionGroups: [ ag.id ]
+ customProperties: {
+ scenario: 'capacity-fallback'
+ }
+ }
+ autoMitigate: true
+ }
+}
+```
+
+Notes
+- If you prefer, use Azure Verified Modules instead of raw resources: action group (avm/res/insights/action-group), activity log alert (avm/res/insights/activity-log-alert), scheduled query rule (avm/res/insights/scheduled-query-rule), logic app (avm/res/logic/workflow).
+- For private Logic App ingress, swap to a function receiver in the Action Group and authorize with an AAD app or MSI.
+
+## SKU/Region fallback playbook
+
+- Decision tree
+```
+Start → Try Primary SKU in Primary Region (any AZ)
+ ├─ Success → Done
+ └─ Fail (AllocationFailure/SKUNotAvailable)
+ → Try Alt SKU in Primary Region (any AZ)
+ ├─ Success → Done
+ └─ Fail → Try Primary SKU in Secondary Region (any AZ)
+ ├─ Success → Done
+ └─ Fail → Queue/Defer, or escalate (quota/capacity ticket)
+```
+
+- Inputs
+ - sku_allowlist: [D4s_v5, D2s_v5, E4s_v5]
+ - region_priority: [eastus, eastus2, centralus]
+ - constraints: requireAcceleratedNetworking=true, zones=any
+
+- Outputs
+ - Selected deployment tuple: (region, zone, sku)
+ - Incident created if no viable path found
+
+## AKS- and PaaS-specific guidance
+
+- AKS
+ - Multiple node pools with different VM sizes and zones
+ - Use Cluster Autoscaler and Pod PriorityClasses for critical workloads
+ - Consider virtual nodes (ACI) for burst
+ - Pre-pull container images to reduce cold-start contention
+ - For GPUs, pre-create tainted GPU pools and schedule with tolerations
+
+- App Service / Functions
+ - Use multiple worker tiers and regional deployments with Traffic Manager/Front Door
+ - For Premium plans, pre-warm instances; use scale-out rules with headroom
+ - Consumption plans: plan for throttling and cold starts; consider Premium for predictability
+
+- Databases
+ - For SQL MI or Hyperscale, plan capacity with HA/DR replicas in paired regions
+ - Use ZRS/GRS storage where applicable; monitor IO caps
+
+## Testing, drill, and validation
+
+- Pre-deployment What-If on all IaC changes
+- Chaos/scale drills that simulate regional or AZ scarcity
+- Blue/green or canary across regions to validate fallback
+- Regularly rehearse quota raise workflows and SLAs
+- Maintain a sandbox subscription for destructive allocation tests
+
+## Cost, reservations, and risk trade-offs
+
+- Reservations for base capacity (commitment vs. flexibility)
+- Savings Plans for broader compute coverage
+- Premium SKUs vs. Standard with caching and scale-out
+- Cross-region data egress vs. availability objectives
+- Spot/low-priority for non-critical batch
+
+## Checklist
+
+- Defined primary and secondary regions with tested failover
+- SKU fallback list encoded in IaC
+- Quota thresholds monitored and auto-escalated
+- What-If and SKU availability checks in CI
+- VMSS/AKS configured for flexible placement
+- Incident runbooks documented and exercised
+
+
+
+

+
Refresh Date: 2025-08-20
+
+
diff --git a/README.md b/README.md
index 8563d65..784e854 100644
--- a/README.md
+++ b/README.md
@@ -1280,7 +1280,7 @@ From [K8s cluster components](https://kubernetes.io/docs/concepts/architecture/)
-

+
Refresh Date: 2025-08-20