Background Information — Aquila Dynamics Labs (ADLs)
Rolling upgrade meltdown
Call — ADLs SEV-1 Bridge (Change Window 22:00–02:00 local)
(Live voice excerpts; timestamps = local time)
22:11:27 – Marco: “We’re T+11 into the window. The web upgrade portal is returning 504s when engineers click ‘Upgrade’. They’re retrying—we’ve got a click storm.”
22:12:04 – Miguel: “Our Flask handler executes the upgrade inline. Gunicorn has a 30s timeout; NGINX surfaces the 504. Those retries likely double-triggered jobs.”
22:13:18 – Marco: “Site-14 just went dark. Both Dist-A and Dist-B reloaded within the same minute. Declaring SEV-1 for the site.”
22:15:21 – Marco: “WAN egress spiked to 4.8 Gbps—each device appears to download the full image from the DC. No peer cache in play.”
22:16:37 – Miguel: “All workers run centrally in the DC. Per-device downloads; no shared cache.”
22:19:05 – You: “Three root causes:
- Synchronous execution in the web tier → 504s → retry storms → duplicate jobs.
- No pair-aware guardrail → both dist switches rebooted.
- Centralized workers with per-device image pulls → WAN surge.”
22:21:42 – Rafael: “Add an idempotency key so retries don’t relaunch the same upgrade.”
22:24:02 – Miguel: “Copy—bringing maintenance page up, wiring queue & pair-lock, and spinning a regional worker now.”
Which actions will directly neutralize the three root causes called out at 22:19:05 while fitting the live change window?
A. Bump NGINX/Gunicorn timeouts to 300s and ask engineers not to double-click.
B. Keep portal live and auto-retry failed HTTP POSTs until 200 OK.
C. Move execution to cloud serverless; call functions directly from Flask during the request.
D. Pre-stage images on a public CDN and let devices pull from the Internet.
E. Introduce a task queue to decouple HTTP from execution, persist jobs with an idempotency token so 504/retries don’t re-launch the same job.
F. Distribute workers per site/region and enable a local image cache (e.g., branch repo) so devices fetch locally, not across the WAN.
G. Enforce one-at-a-time upgrades on redundant pairs (Dist-A, then Dist-B) via a lock/guardrail.
H. Parallelize all upgrades per site to recover schedule; rely on QoS to protect links.