Webhooks That Never Lose an Event: A Technical Guide
Webhooks are the backbone of almost every real integration. A payment clears, an order ships, a subscription renews, and a small HTTP request fires off to your app to keep everything in sync. When they work, the whole business feels automatic. When they fail quietly, you get the worst kind of bug: an order that never got fulfilled, a customer charged twice, a dashboard that drifts out of step with reality and nobody notices for a week.
The hard truth is that webhooks are best-effort by design. Providers send each event at least once, sometimes more than once, sometimes out of order, and occasionally not at all. A reliable integration is not one that hopes those things never happen, it is one that is built assuming they will. This guide walks through the patterns that make a webhook endpoint trustworthy, with code you can adapt to a Vercel route handler or any Node service.
Verify the signature before you trust anything
A webhook endpoint is a public URL that performs privileged actions, so the first job is proving the request actually came from the provider and was not forged or replayed. Every serious provider signs its payloads, usually with an HMAC of the raw body using a shared secret.
Two details trip people up. First, you must verify against the raw request body, exactly as received, not the re-serialized JSON, because re-encoding changes bytes and breaks the signature. Second, compare digests with a timing-safe function so you do not leak the secret through subtle response-time differences.
import crypto from "node:crypto";
function verify(rawBody: string, signature: string, secret: string) {
const expected = crypto.createHmac("sha256", secret).update(rawBody).digest("hex");
const a = Buffer.from(expected);
const b = Buffer.from(signature);
// length check first, then a constant-time compare
return a.length === b.length && crypto.timingSafeEqual(a, b);
}
If the provider includes a timestamp in the signature, reject anything older than a few minutes. That one check stops an attacker from capturing a valid request and replaying it later.
Acknowledge fast, then process later
Providers give you a short window, often only a few seconds, to return a 2xx response. If your handler does real work first, calling other APIs, writing to several tables, sending email, it can blow past that timeout. The provider then assumes failure and retries, and now you have duplicates and a backlog.
The fix is to separate receiving from processing. Verify the signature, hand the raw event to a queue, and return 200 immediately. A background worker does the slow work where it can take its time and retry safely.
export async function POST(req: Request) {
const raw = await req.text(); // raw body for the signature
const sig = req.headers.get("x-signature") ?? "";
if (!verify(raw, sig, process.env.WEBHOOK_SECRET!)) {
return new Response("bad signature", { status: 401 });
}
await queue.enqueue(raw); // hand off the work
return new Response("ok", { status: 200 }); // ack inside the timeout
}
On Vercel this matters even more, because function execution time is limited and you do not want a webhook waiting on a slow third-party call. A managed queue like Upstash QStash, or a simple database-backed job table, is enough to decouple the two halves.
Make every handler idempotent
Because delivery is at-least-once, you will receive the same event more than once, and your processing has to produce the same result whether it runs one time or five. That property is called idempotency, and it is the single most important thing to get right.
The reliable pattern is to dedupe on the provider's event id with a unique constraint, so even two deliveries arriving at the same moment cannot both proceed.
async function handleEvent(event) {
// a unique index on event.id makes this race-safe across concurrent workers
const fresh = await db.processedEvents.insertIfNew(event.id);
if (!fresh) return; // already handled, skip silently
await applyEffects(event); // the real work, run exactly once
}
For the writes themselves, prefer operations that are naturally idempotent: upserts keyed on a stable id rather than blind inserts, and "set status to shipped" rather than "increment a counter." If a step must call another system that charges money, pass that system an idempotency key too, so the safety extends across the whole chain.
Retries, ordering, and out-of-order events
When your endpoint returns an error or times out, the provider retries, usually with exponential backoff over minutes or hours. That is good, it is how a brief outage heals itself, but it means a single event can span a long window, and events can arrive in an order that does not match when they happened.
Never assume order. If you receive an "updated" event before the "created" one, your code should not crash or create a duplicate. Lean on the data instead of the arrival sequence: every payload carries an id and usually a version or updated-at timestamp, so apply an event only if it is newer than what you already have, and treat a missing parent record as a signal to fetch the current state rather than to fail.
When the webhook never arrives: reconciliation
Even perfect endpoints miss events, because the provider had an outage, a deploy dropped a request, or a queue hiccuped. Webhooks alone are not a guarantee, so the most robust integrations pair them with a periodic reconciliation job.
The idea is simple: on a schedule, pull the recent state from the source of truth (the provider's list API) and compare it to your own records, then fill any gaps. Webhooks keep you in sync in real time, and the reconcile pass catches whatever slipped through. This is also your recovery plan after an incident, replay the window you missed instead of hand-fixing rows.
Observability: logs, dead-letter, and alerts
You cannot fix what you cannot see, so treat every event as something to record. Log each delivery with its id, type, and outcome, keep failed jobs rather than discarding them, and route messages that fail repeatedly to a dead-letter queue where a human can inspect them instead of letting them retry forever.
Then alert on the things that mean trouble: a rising failure rate, a growing backlog, a dead-letter queue that is no longer empty. A webhook pipeline that is silent is not necessarily healthy, it might just be failing quietly, which is exactly the situation good logging and alerting exist to prevent.
A reliable-webhook checklist
- Verify every request with a timing-safe HMAC check against the raw body, and reject stale timestamps.
- Acknowledge with a 2xx inside the provider timeout, then process in a background worker or queue.
- Dedupe on the event id with a unique constraint so duplicates are harmless.
- Use idempotent writes (upserts, state sets) and pass idempotency keys to downstream APIs.
- Do not depend on delivery order, apply events using their version or timestamp.
- Add a scheduled reconciliation job to catch missed events and to replay after incidents.
- Log every event, dead-letter poison messages, and alert on failure rate and backlog.
FAQ
Why am I getting duplicate webhook events?
Because providers guarantee at-least-once delivery, not exactly-once. Network blips, slow responses, and retries all cause the same event to arrive more than once, which is expected behavior. The solution is not to stop the duplicates but to make processing idempotent: dedupe on the provider's event id with a unique constraint, and use writes that produce the same result no matter how many times they run.
How do I verify a webhook is really from the provider?
Check the signature the provider sends, which is normally an HMAC of the raw request body using a shared secret. Compute the same HMAC on your side, compare it with a timing-safe function, and reject anything that does not match. Always verify against the exact raw bytes you received rather than re-serialized JSON, and if a timestamp is included, reject old requests to block replays.
Should I process the webhook synchronously in the handler?
No. Providers expect a fast 2xx, often within a few seconds, so doing slow work inline risks timing out and triggering retries. Verify the signature, push the event onto a queue or job table, and return 200 immediately. A background worker then does the real processing where it can take its time and retry safely without the provider re-sending the event.
What happens if my endpoint is down when an event fires?
Most providers retry failed deliveries with backoff for a while, so a short outage usually heals itself once you are back up. For anything longer, or for the rare event that is never redelivered, a scheduled reconciliation job that compares your records against the provider's list API will find and fill the gaps. Relying on retries plus reconciliation is what makes the integration genuinely durable.
Do I really need a queue for webhooks?
For low volume you can sometimes process inline, but a queue is what lets you acknowledge fast, absorb spikes, retry failures without the provider re-sending, and keep a dead-letter record of poison messages. On a serverless platform like Vercel, where execution time is capped, decoupling receipt from processing with a managed queue is the difference between an integration that holds up under load and one that drops events when it matters.
If you are wiring up payments, orders, or any integration where a missed event costs real money, tell me what you are connecting and I will give you a straight technical read on how to make the webhook side bulletproof.
Want a hand applying this?
Tell me where your business is stuck and I will give you a straight, useful read, no pitch.