I’m currently preparing the spratshop.com architecture for a truly distributed set-up with multi-master database replication and automatic load-balancing between the nodes. The spratshop system uses C++ and Java for the back-end services and Rails for the front-end. We have a set of rake tasks that run hourly or daily to monitor, maintain and clean up the system.
In the intended multi-node deployment, these rake tasks will still need to be executed, but I cannot pin them all to a single node because otherwise the system would break as soon as that node has problems. As such, the best way is to run these maintenance cronjobs on every node, to make sure they will continue to execute even when nodes fail.
However, certain maintenance tasks, such as invoice generation, should never run on two different nodes simultaneously, as that might lead to one customer getting multiple invoices. The easy approach would probably to store a “running” flag inside the database for each cronjob and to make the cronjob abort if the running flag is already set. This, however, will break if one node crashes while the cronjob is running as the database flag might or might not have been reset. Also, some of my daily cronjobs are used to monitor and backup the database, so I’d like to keep them separated.
Current solutions
I looked around to see if I could find a good solution for distributed cronjobs and I found Chronos by airbnb which aims to provide a complete distributed cron replacement. I also found a cronjob locking tool named get_zk_lock by Parse. However, they always enforce exclusive execution for a fixed time of 30 seconds. This means that if my cronjob takes longer than those 30s, two cronjobs might still run simultaneously because the first lock has expired. Some of my cronjobs take 15-30 minutes, so this solution is not usable for me.
Introducing zoo-locked
To solve this problem, I built a very simple tool out of the ZooKeeper locking recipe which will acquire an exclusive lock using ZooKeeper and then execute an arbitrary program/script through popen
. While the task is running, the lock will be held. Should the tool crash for whatever reason, the ZooKeeper connection will time out which will release the lock.
The tool is written in C and has no dependencies apart from the zookeper_mt
lib.
You can find the source code on github.com/fxtentacle/zoo-locked.