Webhooks are great, until your destination server crashes or the network connection drops. Webhooks operate close to the “fire and forget” principle by nature, but in critical business processes (like receiving payments), “forgetting” is not an option.
In this article, we will examine how you can make your webhook architecture more resilient and how to reduce data loss to 0%.
The service sending the webhook does not wait for a response forever. There is usually a timeout period of 5-10 seconds. If your endpoint does not respond within this time, the request is considered failed.
Solution: Never perform long-running operations (creating PDFs, sending emails, database reporting) on the endpoint that receives the webhook. The only thing you need to do is:
200 OK to the sender.Every system can crash. Your server might be under maintenance or there might be an instant network error. In these cases, the webhook should not be lost.
Exponential Backoff: Instead of trying a failed request again immediately, the healthiest method is to try by increasing the waiting time.
Definitely review the retry policy of your webhook provider (e.g., Stripe). Also, definitely set up a retry strategy in your own webhook submissions.
The retry mechanism is great but has a dangerous side effect: The same request coming more than once.
If the 200 OK response does not reach the sender due to a network error, the sender sends the same webhook again. If your code is not prepared for this, you might charge the customer twice or create duplicate records in the database.
Solution: There is a unique ID (Event ID) in every webhook request. Keep this ID in your database or Redis.
if redis.exists(event_id) {
return 200 OK; // Already processed, don't do it again.
}
process_payment();
redis.save(event_id);
Since your endpoint is public, malicious people can send fake webhook requests. To prevent this, it is mandatory to verify the HMAC (Hash-based Message Authentication Code) signature. The sender hashes the payload with a secret key and sends it in the header. You do the same operation and check if it matches.
What happens if all retry attempts fail? Instead of deleting the data, move it to a separate area called “Dead Letter Queue”. You can manually examine the erroneous records here later and process them again after resolving the problem.
Building a reliable webhook architecture requires queue management, retry strategies, and security measures.
If you don’t want to build all this infrastructure (Queue, Retry, DLQ, Logging) from scratch, you can use a Webhook Gateway like WebhookIO. WebhookIO receives all incoming requests for you, queues them, and securely delivers them to you when your endpoint is ready.