What the Shopify Webhook Incident Teaches Us About Resilience
By HookSniff Team
Engineering ยท Published on 2026-04-30
On April 28, 2026, Shopify experienced a significant webhook delivery incident that lasted approximately 8 hours. Webhooks that normally arrived within seconds were delayed by minutes to over an hour. When the issue was resolved, a recovery surge flooded downstream systems with 3x the normal webhook volume.
This post analyzes what happened, what we can learn, and how resilient webhook infrastructure should handle these scenarios.
Timeline of the Incident
2026-04-28 Timeline (UTC)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
02:15 โ First reports of delayed webhooks in Shopify community forums
02:45 โ Shopify acknowledges increased webhook latency on status page
03:30 โ Latency increases to 15-30 minutes for most event types
05:00 โ Some webhooks delayed by 45+ minutes; order events most affected
07:00 โ Root cause identified: database migration caused queue backlog
08:30 โ Fix deployed; backlog begins clearing
09:00 โ Recovery surge starts โ 3x normal webhook volume
09:45 โ Downstream systems start reporting 5xx errors from surge
10:15 โ Shopify throttles recovery delivery to 1.5x normal rate
10:30 โ Incident resolved; all webhooks delivered
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโThe Surge Pattern
The most dangerous part of the incident was not the delay โ it was the recovery. Here is what the webhook delivery volume looked like:
Webhook Volume (events/minute)
โ
โ โญโโโฎ Recovery surge
โ โญโฏ โฐโฎ 3x normal
โ โญโฏ โฐโฎ
โ โญโโโโโฎ โญโฏ โฐโฎ
โโโโโฏ โฐโโโโโโโฏ โฐโโโโโโโโโโ Normal
โ โฐโโโฎ โฐโโโโโโฎ
โ โฐโโโฎ โฐโโโโ Backlog clearing
โ โฐโโ Incident window
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Time
02:00 04:00 06:00 08:00 10:00 12:00During the incident window (02:15โ08:30), webhooks accumulated in Shopify's internal queue. When the fix was deployed, all queued webhooks were released simultaneously, creating a surge that overwhelmed unprepared downstream systems.
Why Recovery Surges Are Dangerous
Most webhook consumers are designed for steady-state traffic. They handle normal volume fine but break under sudden spikes:
- **Connection pool exhaustion** โ Database connections max out
- **Memory pressure** โ Queued processing tasks consume all available RAM
- **Rate limit hits** โ Third-party API rate limits get triggered
- **Cascading failures** โ One slow consumer backs up the entire pipeline
The irony: the systems that survived the 8-hour delay just fine were the ones that crashed during the recovery.
Lessons for Webhook Consumers
**1. Design for 3x burst capacity.** Your webhook endpoint should handle 3x your normal peak volume without degradation. This means connection pooling, async processing, and backpressure mechanisms.
**2. Implement circuit breakers.** If your downstream service starts returning 5xx, stop sending and queue locally. A circuit breaker prevents cascading failures during surge events.
**3. Use dead letter queues.** If processing fails after retries, preserve the event. Do not drop webhooks โ they contain critical business data.
**4. Monitor p99 latency, not just averages.** During the Shopify incident, average latency was misleading. P99 showed the real story: some webhooks were delayed by over an hour while most arrived within minutes.
**5. Implement idempotent processing.** Recovery surges may deliver events that were partially processed before the incident. Idempotency ensures duplicate processing is safe.
How HookSniff Handles Incident Recovery
HookSniff was designed with these scenarios in mind. Here is how we handle recovery surges:
**Exponential backoff with jitter.** Failed deliveries retry with increasing delays (10s, 30s, 2m, 10m, 30m) plus random jitter. This spreads retry traffic and prevents thundering herd problems.
"code-comment">// HookSniff retry configuration
const retryPolicy = {
maxAttempts: 5,
backoff: 'exponential',
baseDelay: 10000, "code-comment">// 10 seconds
maxDelay: 1800000, "code-comment">// 30 minutes
jitter: true, "code-comment">// Random ยฑ25% to spread load
};**Circuit breaker per endpoint.** If an endpoint fails 5 consecutive deliveries, we open the circuit for 5 minutes. This prevents us from hammering a struggling service during a surge.
Endpoint Health Check:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ endpoint: https:"code-comment">//shop.example.com/wh โ
โ status: OPEN (circuit tripped) โ
โ failures: 5 consecutive โ
โ cooldown: 4m 32s remaining โ
โ last_error: 503 Service Unavailable โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ**Dead letter queue with batch replay.** Events that exhaust all retries move to the DLQ. When the downstream service recovers, operators can batch-replay all dead-lettered events with a single API call.
"code-comment"># Batch replay all dead-lettered events for an endpointclient = HookSniff(api_key="hs_...")
# Replay all DLQ events for the affected endpoint result = client.dead_letters.replay_all( endpoint_id="ep_shopify_integration", after="2026-04-28T02:00:00Z", before="2026-04-28T10:30:00Z", )
print(f"Replayed {result.count} events") ```
**Per-endpoint throttling.** During recovery, we limit delivery rate per endpoint to prevent overwhelming downstream systems. Default: 100 requests/second per endpoint, configurable.
async fn apply_throttle(endpoint: &Endpoint, delivery: &Delivery) -> Result<()> {
let rate = endpoint.throttle_rate.unwrap_or(100); "code-comment">// req/sif rate_limiter.check(&endpoint.id, rate, window).await?.is_limited() { // Re-queue with delay instead of dropping delivery.retry_at(chrono::Utc::now() + chrono::Duration::seconds(1)).await?; return Err(Error::Throttled); }
Ok(()) } ```
Monitoring Checklist
After reviewing the Shopify incident, here is what every webhook consumer should monitor:
- **Delivery latency p50/p95/p99** โ Not just average
- **Queue depth** โ How many webhooks are pending delivery
- **Error rate by endpoint** โ Per-consumer health
- **Retry rate** โ Spikes indicate downstream issues
- **Circuit breaker state** โ Open circuits need attention
- **DLQ depth** โ Growing DLQ means lost events
The Bigger Picture
The Shopify incident is a reminder that webhook infrastructure is only as resilient as its weakest consumer. The delivery service (Shopify) recovered, but many downstream systems were not prepared for the surge.
Building resilient webhook consumers is not optional โ it is a production requirement. Plan for 3x burst capacity, implement circuit breakers, use dead letter queues, and monitor p99 latency.
And if you do not want to build all of that yourself, HookSniff handles it out of the box. Sign up at hooksniff.vercel.app โ your first 10,000 webhooks per month are free.