How We Implemented Webhooks

Webhooks are big deal and are used by literally thousands of companies for communicating between systems. Since OpenTHC is web-enabled software that communicates with external systems we also have to do webhooks. Incoming and outgoing webhooks.

One nice thing about webhooks is that they are not (typically) in the critical request path. That is, they are treated as a fire-and-forget type of process. Rarely do we see systems where a response body, with some logic applied, is required (but we do see them and would argue those aren't webhook but RPC).

Recently on Hacker News there were some posts for companies that provide services build around one, or the other side of this process. Take a look at Svix for outgoing and HookDeck for incoming.

Incoming Webhooks

Incoming webhooks are when other systems want to yell data at us. This includes external providers such as PayPal, Plivo, Stripe, Twilio and WeedMaps Some of our other projects (B2B, BONG, PIPE) also communicate with each other via webhooks which we need to consume.

For the incoming webhooks we need to make sure these systems are highly available and reliable. Dropping the incoming messages could mean we miss out on sale information.

Our infrastructure for incoming webhooks ("WHIMP") is a geo/net-diverse HTTPS listener and router. A service runs on *:443 to accept these messages and is configured with incoming routes and internal-routing table to forward those requests to. This allows our internal infrastructure to move around (pods-up, pods-down) and the outside world won't notice a thing.

WHIMP is configured (YAML) with an incoming route like /$SERVICE/$HASH, eg: /paypal/01FD5BKAC5EEJ6HA8KPXT5QV4Y. And then the external service is configured for this route, eg: https://whimp.example.com/paypal/01FD5BKAC5EEJ6HA8KPXT5QV4Y. The primary reason we make these paths hard to guess to reduce noise in our logs. WHIMP accepts and stores the incoming request, it's a plain file-system based storage (ie maildir). Then we do a lookup for the backend route to forward to that system with retry.

WHIMP runs on an isolated network from the rest of our infrastructure, there are no VPN or other privileged tunnels. Since we know it's public IP, thats what is allowed through our firewall to the necessary systems, over HTTPS. The logs are sent to syslog (which goes to the centralised logging tools) and metrics are exposed for Prometheus.

Outgoing Webhooks

Outgoing webhooks are when we are yelling data at other systems. The most popular one is to catch a webhook from sale transactions but inventory updates and harvest are interesting too.

Similar issues with privileged exist here but we have no control over the availabity. This means we must take the responsibility for re-try, the easy part. We also have to own logging the errors and making sure those can bubble up to the webhook subscriber.

We use a tool called WHOMP for this. Our internal systems execute webhooks like https://whomp.example.com/https://remote.endpoint.example.net/path/to/webhook?foo=bar and are the full HTTP message that should be sent, headers and body. WHOMP interprets this as a request to https://remote.endpoint.example.net/path/to/webhook?foo=bar and will handle the retry process. WHOMP also pre-flight checks these requests to make sure they resolve in an expected way. There is a configuration (YAML) that provides a reject list of hostnames and IP addresses. This way we're not unintentionally sending hooks to, eg: PUT http://169.254.169.254/latest/api/token.

We run multiple WHOMP services but they are on isolated public networks, that is no VPNs or tunnels to our core infrastructure. The logs are sent to syslog (which goes to the centralised logging tools) and metrics are exposed for Prometheus.

Why Not Hosted Services?

The primary reason we are not using the hosted solutions mentioned at the start is that we had these system in place already. If it ain't broke, fix-it 'til it is.

Another key feature of our self-hosted solutions also make sure we're treating PII according to the various regulatory agencies we have to work with (our line of business has lots of government oversight).

There is moderate savings from an ongoing cost basis. WHOMP and WHIMP run on bog-standard systems, eg: Debian9+, FreeBSD And because they are golang, it's a single binary to drop in and run, so deployment is trivial to VM or Pod or whatever. Basically an HA environment can be created for $10/mo in hosting.

For our team deploying a new box is pretty easy, there are loads of tools to help you with this too from Bash to K8s. It's fairly easy for even small teams to deploy narrowly-scoped services using these tools they are already comfortable with. Self-hosted also provides tighter integration with the main-line applications and perhaps easier testing (if you do it the "right way").

A popular virtue of paid-hosted is that it's "not your responsibility" but when (not if) a system is having some outage, I like to have the control. With WHOMP and WHIMP we can move quickly between $PROVIDER using existing deploy tools. And when these paid-hosted providers are simply configuring AWS the same way we are for HA it feels like added complexity for the same service level.

It's possible to talk circles about the virtues of self-hosted vs paid-hosted. It should be clear from this post we Stan self-hosted. Please forgive me if I didn't touch on your favourite talking point.