diff --git a/0-Energy_Hardware_Optimization.md b/0-Energy_Hardware_Optimization.md index 70a490a..f2c646d 100644 --- a/0-Energy_Hardware_Optimization.md +++ b/0-Energy_Hardware_Optimization.md @@ -270,7 +270,7 @@ az aks update -g ResourceGroup -n EfficientAKS --enable-cluster-autoscaler --min
- Total views + Total views

Refresh Date: 2025-08-20

diff --git a/1-Cloud_k8s_Scaling.md b/1-Cloud_k8s_Scaling.md index 846721e..4ad261e 100644 --- a/1-Cloud_k8s_Scaling.md +++ b/1-Cloud_k8s_Scaling.md @@ -1285,7 +1285,7 @@ foreach ($vm in $vmList) {
- Total views + Total views

Refresh Date: 2025-08-20

diff --git a/2-Azure_Stack_Architecture.md b/2-Azure_Stack_Architecture.md index 89a12ac..37fd44c 100644 --- a/2-Azure_Stack_Architecture.md +++ b/2-Azure_Stack_Architecture.md @@ -1065,7 +1065,7 @@ Layer Interactions:
- Total views + Total views

Refresh Date: 2025-08-20

diff --git a/3-Azure_Capacity_Challenges.md b/3-Azure_Capacity_Challenges.md index 8b13789..79677ce 100644 --- a/3-Azure_Capacity_Challenges.md +++ b/3-Azure_Capacity_Challenges.md @@ -1 +1,629 @@ +# Azure Capacity Challenges — Causes, Signals, and Proactive Strategies +[![GitHub](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com) +[![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/) +[brown9804](https://github.com/brown9804) + +Last updated: 2025-08-20 + +----------------------------- + +> This community demo is for learning only and uses public documentation. It blends theory and practical examples (no cloud sign-in required). For production guidance, cost/security/compliance, and Azure-specific deployment patterns, contact Microsoft directly: [Microsoft Sales and Support](https://support.microsoft.com/contactus?ContactUsExperienceEntryPointAssetId=S.HP.SMC-HOME) + +
+List of References (Click to expand) + +- Azure status and service health + - https://status.azure.com + - https://learn.microsoft.com/azure/service-health/overview +- Azure regional services and availability + - https://azure.microsoft.com/global-infrastructure/services/ + - https://learn.microsoft.com/azure/availability-zones/az-overview +- VM sizes, SKUs, and quotas + - https://learn.microsoft.com/azure/virtual-machines/sizes + - https://learn.microsoft.com/azure/quotas/quotas-overview + - https://learn.microsoft.com/azure/quotas/per-vm-family-quota-requests +- Capacity error patterns and mitigations + - https://learn.microsoft.com/azure/azure-resource-manager/troubleshooting/error-codes + - https://learn.microsoft.com/azure/virtual-machines/troubleshooting/allocation-failure +- Reservations, savings plans, and scale sets + - https://learn.microsoft.com/azure/cost-management-billing/reservations/save-compute-costs-reservations + - https://learn.microsoft.com/azure/virtual-machine-scale-sets/overview +- AKS scaling and schedulability + - https://learn.microsoft.com/azure/aks/cluster-autoscaler + - https://learn.microsoft.com/azure/aks/start-stop-cluster +- Storage and networking capacity + - https://learn.microsoft.com/azure/storage/common/scalability-targets-standard-account + - https://learn.microsoft.com/azure/azure-resource-manager/management/azure-subscription-service-limits +- Azure Advisor and capacity planning + - https://learn.microsoft.com/azure/advisor/advisor-overview +- Workload identity and regional expansion + - https://learn.microsoft.com/azure/reliability/cross-region-replication-azure + +
+ +
+Table of Contents (Click to expand) + +- [What are Azure Capacity Challenges?](#what-are-azure-capacity-challenges) +- [Why capacity constraints happen](#why-capacity-constraints-happen) +- [Common signals and error codes](#common-signals-and-error-codes) +- [Proactive planning and design](#proactive-planning-and-design) +- [Operational playbooks (runbooks)](#operational-playbooks-runbooks) +- [Automation examples (CLI/PowerShell/Bicep/KQL)](#automation-examples-clipowershellbicepkql) +- [Error-to-Action mapping](#error-to-action-mapping) +- [Alerting and auto-remediation](#alerting-and-auto-remediation) +- [CI/CD gates and policy guardrails](#cicd-gates-and-policy-guardrails) +- [Quota-as-Code automation](#quota-as-code-automation) +- [Policy config schema (region/SKU/quotas)](#policy-config-schema-regionskuquotas) +- [IaC quickstart: Action Group + Alerts + Logic App](#iac-quickstart-action-group--alerts--logic-app) +- [SKU/Region fallback playbook](#skuregion-fallback-playbook) +- [AKS- and PaaS-specific guidance](#aks--and-paas-specific-guidance) +- [Testing, drill, and validation](#testing-drill-and-validation) +- [Cost, reservations, and risk trade-offs](#cost-reservations-and-risk-trade-offs) +- [Checklist](#checklist) + +
+ +> Capacity issues in Azure surface in two broad buckets: quota (soft) limits and physical capacity (hard) constraints. Effective designs anticipate both, offer SKU/region flexibility, and automate detection, fallback, and escalation. + +## What are Azure Capacity Challenges? + +- Soft constraints: subscription/resource quotas (per-VM family cores, public IPs, NICs, vCPU per region, AKS node pools, etc.) +- Hard constraints: regional/AZ scarcity of specific SKUs, ephemeral capacity during incidents, or burst demand (e.g., GPUs) +- Scope: region-level, zone-level, cluster/rack-level, or specific hardware features (e.g., Ultra Disk, GPUs, NVMe) + +
+Capacity risk scenarios + +- New region or AZ not yet enabled for a service/SKU +- Hot SKU (e.g., GPUs, Premium SSD v2, Ultra Disk) in short supply +- Highly constrained shapes (large RAM/CPU, confidential computing) +- Scale-out during an incident or global event +- Zonal pinning creating skew (all demand in a single AZ) +- Strict placement policies (PPG/availability sets) limiting allocatable hosts + +
+ +## Why capacity constraints happen + +- Demand spikes: seasonal events, marketing launches, or incident-induced migrations +- Hardware specialization: GPUs/NPUs or Ultra Disk clusters are finite per region/AZ +- Zonal affinity: all workloads targeting one zone +- Fixed regional envelopes: datacenter lead times vs. sudden growth +- SKU features mismatch: requiring features not present in selected region/zone +- Quota not aligned: per-VM family vCPU not raised ahead of scale + +
+Preventable causes and anti-patterns + +- Single-region dependency without failover +- Tightly constrained SKU choices (one exact size) with no fallbacks +- Overuse of proximity placement groups beyond strict latency needs +- Ignoring per-family quotas during IaC rollouts +- Fixed zonal mappings without elasticity +- Manual-only escalation for quota increases + +
+ +## Common signals and error codes + +- AllocationFailure: The requested VM size/zone/region currently cannot be allocated +- OverconstrainedAllocationRequest / ZonalAllocationFailed: constraints prevent placement +- QuotaExceeded: Subscription or per-VM-family quota insufficient +- OperationNotAllowed: Service limit reached (e.g., IPs, NICs, disks) +- SKUNotAvailable: Size not available in selected region/zone +- InsufficientMemory/InsufficientCores (service-specific messages) + +
+How to confirm and triage + +- Check Service Health and Resource Health for regional advisories +- Query Activity Logs for failed deployments and error codes +- Use What-If before large template rollouts to detect quota gaps +- Attempt allocation in alternate zone or region to isolate scope +- Validate SKU availability programmatically + +
+ +## Proactive planning and design + +- Multi-AZ and multi-region ready: design for N+1 regions with active/active or active/passive +- SKU flexibility: define a prioritized list of sizes per workload class +- Region flexibility: primary/secondary/tertiary region matrix, aligned to data residency +- Zonal elasticity: allow any-of AZs unless strict locality is required +- Quota-as-code: pre-raise quotas in pipelines; track as configuration +- Use scale sets with mixed or flexible orchestration modes +- Reservations/Savings Plans for steady base; burst on-demand +- For GPUs or Ultra Disk: pre-provision warm capacity with health checks + +
+Architecture patterns + +- Active/Active with Front Door or Traffic Manager across 2+ paired regions +- VMSS Flexible Orchestration with multiple SKUs in priority order +- AKS multiple node pools with alternative VM sizes and zones +- Stateless app tier and stateful data layer with geo-replication (ZRS/GRS, AG listener, Cosmos DB multi-region) +- Deployment rings: canary → regional → multi-region +- Feature flags to toggle region or SKU at runtime + +
+ +## Operational playbooks (runbooks) + +- Detect: Monitor allocation failures and quota nearing thresholds +- Decide: Auto-select alternative AZ/SKU/region according to policy +- Do: Retry with relaxed constraints; escalate quota requests automatically +- Document: Log incidents, annotate cost/latency impacts, and update the allowlists + +
+Runbook examples + +- VMSS scale-out fails with AllocationFailure → retry with next SKU; if repeat, pick next AZ; if repeat, shift to secondary region +- AKS pending pods due to unschedulable nodes → enable/verify Cluster Autoscaler; add alt-size node pool; temporarily taint/cordon and drain +- QuotaExceeded detected in What-If → programmatically raise per-family quota and block merge until approved +- GPU scarcity → shift batch/training to Batch with low-priority or spot VMs in alternate region; queue jobs + +
+ +## Automation examples (CLI/PowerShell/Bicep/KQL) + +- List available VM sizes/SKUs by region + +```powershell +# PowerShell +Get-AzVMSize -Location eastus | Sort-Object Name | Select-Object -First 10 +``` + +```json +// Bicep (snippet) - VMSS Flexible with multiple SKUs +// Note: illustrative snippet; adapt to your module style +``` + +```bicep +param location string = resourceGroup().location +param skuPrimary string = 'Standard_D4s_v5' +param skuAlt string = 'Standard_D2s_v5' + +resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2024-03-01' = { + name: 'web-flex' + location: location + sku: { + name: skuPrimary + capacity: 2 + } + properties: { + orchestrationMode: 'Flexible' + upgradePolicy: { mode: 'Rolling' } + virtualMachineProfile: { + priorityMixPolicy: { + baseRegularPriorityCount: 2 + } + osProfile: { + computerNamePrefix: 'web' + adminUsername: 'azureuser' + } + storageProfile: { + imageReference: { + publisher: 'Canonical' + offer: '0001-com-ubuntu-server-jammy' + sku: '22_04-lts' + version: 'latest' + } + } + networkProfile: { + networkInterfaceConfigurations: [ + { + name: 'nic' + properties: { + primary: true + ipConfigurations: [{ name: 'ipconfig' }] + } + } + ] + } + } + } +} + +// Alternate SKU VM resource to join VMSS Flex as instance +resource vmAlt 'Microsoft.Compute/virtualMachines@2024-03-01' = { + name: 'web-alt-001' + location: location + properties: { + virtualMachineScaleSet: { + id: vmss.id + } + hardwareProfile: { + vmSize: skuAlt + } + storageProfile: { + imageReference: { + publisher: 'Canonical' + offer: '0001-com-ubuntu-server-jammy' + sku: '22_04-lts' + version: 'latest' + } + } + osProfile: { + computerName: 'web-alt-001' + adminUsername: 'azureuser' + linuxConfiguration: { disablePasswordAuthentication: true } + } + networkProfile: { + networkInterfaces: [ + { + id: resourceId('Microsoft.Network/networkInterfaces', 'nic-web-alt-001') + properties: { primary: true } + } + ] + } + } +} +``` + +- Query allocation failures and quotas in Activity Logs and Azure Monitor + +```kql +// Activity Logs: VM allocation failures last 24h +AzureActivity +| where TimeGenerated > ago(24h) +| where OperationNameValue has 'write' and ActivityStatusValue == 'Failed' +| where Properties has_any ('AllocationFailure','Overconstrained','SKUNotAvailable','QuotaExceeded') +| project TimeGenerated, ResourceGroup, Resource, OperationNameValue, ActivityStatusValue, Properties +``` + +- Programmatically request quota increases + +```powershell +# Example: Increase vCPU per-VM family quota +# Note: Use Az.Quota cmdlets when available in your environment +# Fallback to Azure Portal or REST API for specific providers if needed +``` + +- Validate SKU availability via CLI + +```powershell +# Azure CLI in PowerShell shell +az vm list-skus --location eastus --output table | Select-String D4s_v5 +``` + +## Error-to-Action mapping + +- AllocationFailure + - Immediate: Retry with next allowed SKU (same region, any AZ) via VMSS Flex or parameterized IaC + - Next: Try alternative AZ; if still failing, try paired/secondary region + - Follow-up: Open capacity ticket only if pattern persists across AZs/regions; enrich with Activity Log evidence + +- SKUNotAvailable + - Immediate: Switch to nearest-performance SKU or adjacent family (e.g., Dv5 ↔ Ev5) from an approved allowlist + - Next: Check region availability list; move only burst capacity when possible + - Follow-up: Update allowlist; revisit reservations/savings plans to align with observed availability + +- QuotaExceeded + - Immediate: Re-balance to other families/regions or temporarily cap scale-out + - Next: Auto-raise per-VM-family vCPU quota with approval workflow; block rollout until raised + - Follow-up: Increase proactive thresholds; embed What-If gates in CI + +- OverconstrainedAllocationRequest / ZonalAllocationFailed + - Immediate: Relax constraints (allow any-of zones; remove non-critical PPG) + - Next: Add alternate SKU or region; retry with wider placement + - Follow-up: Document minimal viable constraints in design + +## Alerting and auto-remediation + +- Alerts to create + - Activity log alert: AllocationFailure / SKUNotAvailable / QuotaExceeded events + - Metric alerts: VMSS pending instances, AKS Pending pods > N for M minutes + - Service Health: Regional capacity advisories for target regions + +- KQL alert (activity failures) +```kql +AzureActivity +| where TimeGenerated > ago(15m) +| where ActivityStatusValue == 'Failed' +| where Properties has_any ('AllocationFailure','SKUNotAvailable','QuotaExceeded','Overconstrained') +| summarize failures = count() by bin(TimeGenerated, 5m) +| where failures > 0 +``` + +- Auto-remediation patterns + - Logic App/Function: on alert, re-deploy with next SKU/AZ, or create quota request; attach incident context + - Pipeline gate: block infra rollout if capacity alerts fired in last 30 minutes + - Ticketing integration: create/route incident with runbook decision tree + +## CI/CD gates and policy guardrails + +- Pipeline gates + - Deployment What-If on all infra PRs; fail when QuotaExceeded is predicted + - SKU availability probe per target region before rollout + - Require populated fallback parameters (alt SKUs, secondary region) + +- Policy guardrails (examples) + - Deny disallowed SKUs; Audit PPG usage unless tag reason is present + - Require minRegions >= 2 for tier-X services + - Enforce tags: region-priority, sku-allowlist-version + +## Quota-as-Code automation + +- Desired state approach + - Track per-VM-family vCPU quotas by region in config (YAML/JSON) + - Pipeline reconciles desired vs actual and raises requests ahead of scale events + +- Example outline (PowerShell pseudocode) +```powershell +$desired = @( + @{ region='eastus'; family='Dsv5'; vcpus=200 }, + @{ region='eastus2'; family='Dsv5'; vcpus=200 } +) +foreach ($q in $desired) { + # Get current quota for $q.family in $q.region + # If current < $q.vcpus → submit quota increase request and notify approvers +} +``` + +## Policy config schema (region/SKU/quotas) + +- Minimal, repo-friendly schema to drive fallback, placement, and quota reconciliation. + +```yaml +# policy.yaml +version: 1 +policy: + regionPriority: + - eastus + - eastus2 + - centralus + skuAllowlist: + - Standard_D4s_v5 + - Standard_D2s_v5 + - Standard_E4s_v5 + constraints: + zones: any + requireAcceleratedNetworking: true + quotas: + compute: + Dsv5: + eastus: 200 + eastus2: 200 + publicIps: + eastus: 100 +alerts: + allocationFailure: + window: PT15M + threshold: 1 +``` + +```json +{ + "version": 1, + "policy": { + "regionPriority": ["eastus", "eastus2", "centralus"], + "skuAllowlist": ["Standard_D4s_v5", "Standard_D2s_v5", "Standard_E4s_v5"], + "constraints": { + "zones": "any", + "requireAcceleratedNetworking": true + }, + "quotas": { + "compute": { "Dsv5": { "eastus": 200, "eastus2": 200 } }, + "publicIps": { "eastus": 100 } + } + }, + "alerts": { "allocationFailure": { "window": "PT15M", "threshold": 1 } } +} +``` + +Guidance +- Source-control the schema; bump version when policy changes. Validate in CI before deploys. +- Feed this into your quota reconciler and your fallback selector to keep behaviors consistent. + +## IaC quickstart: Action Group + Alerts + Logic App + +- Deploys: Action Group (common schema enabled), Activity Log Alert for capacity errors, KQL Log Alert, and a Logic App (Consumption) that receives the alert via webhook to trigger an automated fallback/runbook. +- API versions aligned with current schemas: actionGroups@2023-01-01, activityLogAlerts@2020-10-01, scheduledQueryRules@2023-12-01, logic/workflows@2019-05-01. + +```bicep +// params +param location string = resourceGroup().location +param actionGroupName string = 'cap-alerts-ag' +param actionGroupShort string = 'capag' +param lawResourceId string // Log Analytics workspace resourceId for KQL alert scopes + +// Logic App (Consumption) with an HTTP trigger named 'manual' +resource wf 'Microsoft.Logic/workflows@2019-05-01' = { + name: 'cap-fallback-la' + location: location + properties: { + state: 'Enabled' + definition: { + '$schema': 'https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#' + 'contentVersion': '1.0.0.0' + 'parameters': {} + 'triggers': { + 'manual': { + 'type': 'Request', + 'kind': 'Http', + 'inputs': { + 'schema': {} + } + } + } + 'actions': { + 'DecideAndInvoke': { + 'type': 'Http', + 'inputs': { + 'method': 'POST', + // TODO: replace with your pipeline/runbook endpoint + 'uri': 'https://example.com/fallback-run', + 'headers': { + 'Content-Type': 'application/json' + }, + 'body': { + 'alert': "@{triggerBody()}" + } + } + } + }, + 'outputs': {} + } + } +} + +// Build Logic App trigger callback URL for Action Group receiver +var wfTriggerCallback = listCallbackUrl(resourceId('Microsoft.Logic/workflows/triggers', wf.name, 'manual'), '2019-05-01').value + +// Action Group with Logic App receiver (Common Alert Schema recommended) +resource ag 'Microsoft.Insights/actionGroups@2023-01-01' = { + name: actionGroupName + location: 'global' + properties: { + enabled: true + groupShortName: actionGroupShort + logicAppReceivers: [ + { + name: 'cap-fallback-la' + resourceId: wf.id + callbackUrl: wfTriggerCallback + useCommonAlertSchema: true + } + ] + } +} + +// Activity Log Alert for capacity-related failures +resource ala 'Microsoft.Insights/activityLogAlerts@2020-10-01' = { + name: 'cap-activity-alert' + location: 'global' + properties: { + enabled: true + scopes: [ subscription().id ] + condition: { + allOf: [ + { field: 'status', equals: 'Failed' } + { field: 'category', equals: 'Administrative' } + // Match common capacity errors embedded in properties + { field: 'properties', containsAny: [ 'AllocationFailure', 'SKUNotAvailable', 'QuotaExceeded', 'Overconstrained' ] } + ] + } + actions: { + actionGroups: [ { actionGroupId: ag.id } ] + } + description: 'Route capacity allocation/quota failures to Logic App for auto-remediation.' + } +} + +// Scheduled Query (KQL) Alert over Activity Logs (or over AzureActivity in LAW) +resource kql 'Microsoft.Insights/scheduledQueryRules@2023-12-01' = { + name: 'cap-kql-alert' + location: location + properties: { + enabled: true + displayName: 'Capacity allocation failures (KQL)' + description: 'Detect allocation/quota failures via KQL and invoke action group.' + severity: 2 + evaluationFrequency: 'PT5M' + windowSize: 'PT15M' + criteria: { + allOf: [ + { + query: ''' +AzureActivity +| where TimeGenerated > ago(15m) +| where ActivityStatusValue == "Failed" +| where Properties has_any ("AllocationFailure","SKUNotAvailable","QuotaExceeded","Overconstrained") +| summarize failures = count() +''' + timeAggregation: 'Count' + operator: 'GreaterThan' + threshold: 0 + } + ] + } + scopes: [ lawResourceId ] + actions: { + actionGroups: [ ag.id ] + customProperties: { + scenario: 'capacity-fallback' + } + } + autoMitigate: true + } +} +``` + +Notes +- If you prefer, use Azure Verified Modules instead of raw resources: action group (avm/res/insights/action-group), activity log alert (avm/res/insights/activity-log-alert), scheduled query rule (avm/res/insights/scheduled-query-rule), logic app (avm/res/logic/workflow). +- For private Logic App ingress, swap to a function receiver in the Action Group and authorize with an AAD app or MSI. + +## SKU/Region fallback playbook + +- Decision tree +``` +Start → Try Primary SKU in Primary Region (any AZ) + ├─ Success → Done + └─ Fail (AllocationFailure/SKUNotAvailable) + → Try Alt SKU in Primary Region (any AZ) + ├─ Success → Done + └─ Fail → Try Primary SKU in Secondary Region (any AZ) + ├─ Success → Done + └─ Fail → Queue/Defer, or escalate (quota/capacity ticket) +``` + +- Inputs + - sku_allowlist: [D4s_v5, D2s_v5, E4s_v5] + - region_priority: [eastus, eastus2, centralus] + - constraints: requireAcceleratedNetworking=true, zones=any + +- Outputs + - Selected deployment tuple: (region, zone, sku) + - Incident created if no viable path found + +## AKS- and PaaS-specific guidance + +- AKS + - Multiple node pools with different VM sizes and zones + - Use Cluster Autoscaler and Pod PriorityClasses for critical workloads + - Consider virtual nodes (ACI) for burst + - Pre-pull container images to reduce cold-start contention + - For GPUs, pre-create tainted GPU pools and schedule with tolerations + +- App Service / Functions + - Use multiple worker tiers and regional deployments with Traffic Manager/Front Door + - For Premium plans, pre-warm instances; use scale-out rules with headroom + - Consumption plans: plan for throttling and cold starts; consider Premium for predictability + +- Databases + - For SQL MI or Hyperscale, plan capacity with HA/DR replicas in paired regions + - Use ZRS/GRS storage where applicable; monitor IO caps + +## Testing, drill, and validation + +- Pre-deployment What-If on all IaC changes +- Chaos/scale drills that simulate regional or AZ scarcity +- Blue/green or canary across regions to validate fallback +- Regularly rehearse quota raise workflows and SLAs +- Maintain a sandbox subscription for destructive allocation tests + +## Cost, reservations, and risk trade-offs + +- Reservations for base capacity (commitment vs. flexibility) +- Savings Plans for broader compute coverage +- Premium SKUs vs. Standard with caching and scale-out +- Cross-region data egress vs. availability objectives +- Spot/low-priority for non-critical batch + +## Checklist + +- Defined primary and secondary regions with tested failover +- SKU fallback list encoded in IaC +- Quota thresholds monitored and auto-escalated +- What-If and SKU availability checks in CI +- VMSS/AKS configured for flexible placement +- Incident runbooks documented and exercised + + +
+ Total views +

Refresh Date: 2025-08-20

+
+ diff --git a/README.md b/README.md index 8563d65..784e854 100644 --- a/README.md +++ b/README.md @@ -1280,7 +1280,7 @@ From [K8s cluster components](https://kubernetes.io/docs/concepts/architecture/)
- Total views + Total views

Refresh Date: 2025-08-20