LeafData Outage July 12th, 2021
On July 12th, 2021 the LeafData system in Washington suffered a severe outage. The system was offline for fourteen (14) hours.
Times are given in UTC, followed by (local) which is America/Los_Angeles
A Redis Blip
At 2021-07-12 05:59:56 (2021-07-11 22:59:56 local) we saw a small blip on Redis
Error: Error while reading line from the server. [tcp://wa-redis-prod-0313.xzqqn1.0001.usgw1.cache.amazonaws.com:6379]
But that is not too un-common and was only observed once. It's common on AWS for the underlying network to "wiggle". This can be because of AWS management magic (failover, HA, automatic provisioning, etc) or because of intended configuration changes such as migration.
In short, this is a non-issue.
No Database Connection
At 2021-07-12 09:27:49 (02:27:49 local) the database connection started failing. The system was offline.
Error: SQLSTATE[HY000] [2003] Can't connect to MySQL server on 'wa-prod-post-1.cljbi63ajzfp.us-gov-west-1.rds.amazonaws.com' (111) (SQL: select * from `users` where `api_key` is not null and `users`.`deleted_at` is null)
The critical part, cleaned up a little is Can't connect to MySQL server on wa-prod-post-1.cljbi63ajzfp.us-gov-west-1.rds.amazonaws.com. It means, the web application cannot connect to a database server. It's also showing us that they are using AWS Gov-Cloud RDS.
Status Update (T + 06:24)
A little more than six hours after the outage started, at 2021-07-12 15:51 UTC (08:51 local), we received a message from the LCB.
Good Morning. LCB and MJ Freeway are aware that there is a issue with the Leaf application.
MJ Freeway is responding. LCB will send further updates at 1 pm if a resolution is not achieved before then.
Another at (T + 07:58)
At 2021-07-12 16:30:00 (09:30 local) we received a second message from the LCB.
Licensees are reporting that they experiencing problems with the Leaf traceability system this morning. This brief message is to notify you that LCB and MJ Freeway are aware of the issue and working to resolve ASAP.
LCB or Leaf will provide an update later today if the issue continues. Thank you for your patience as we work to resolve the problem quickly.
Fixed! (T + 14:04)
At 2021-07-12 23:31:11 (16:31:11 local) the system appears to be back online and responding as expected.
After-Thoughts
It's confounding how a provider in AWS could have an outage this long. What was going on that it took so long anyone to even notice the outage -- it went offline at roughly 2:30 AM Pacific time and it took six (6!) hours for any notification to be sent! Then it took another seven (7!) hours for the service to be restored!
This is a critical system for 100% of the cannabis industry in Washington State. Imagine if your business was blocked from any work for this long.
It's really unfortunate that after this event there was no post-mortem provided by either the vendor or the agency. Perhaps it works in the LCB's favour that they've cancelled their monthly technology-integrator meeting -- it's easier for them to ignore their responsibility to the citizens.
This is one of the primary reasons we push for government to run internally hosted, open source solutions. Open Source allows the citizens can observe what the systems are doing, a requirement for trust. And the agency that has responsibility for the system can actually own it and fulfill their obligations to the citizens. With closed source we get "there's nothing we can do, oh well" -- exactly the kind of apathy we don't want.