30 January 2019 / DEVOPS, AWS

Self healing systems

Many years ago, I was on a team that supported an application that required a weekly restart, according the vendor who was unable or unwilling to fix the slow memory leak that plaugued the sytem. It was a scheduled maintenance window in which a human ran a script that restarted the application. Every Wednesday at 4pm whoever was on-call that week fired off an email to users reminding them of the scheduled outage beginning at 5pm. Then at 5pm, the restart script was manually run. The on-call person waited for the script to complete, verified the app was up and sent an email saying the outage was over. What a waste of time!

This work was done in order to proactively heal the system before the known memory leak became a problem. Awsome idea! Everyone likes being proactive! This manual process went on for years before I convinced folks that this needed to be automated. But automating this process was only the first step.

The self healing system is monitored In order for a system to heal itself, there needs to be monitoring. How do we know if the system needs to be restarted? Well, it’s been a week, so it probably need to be restarted. I guess that’s a start, so we automated the weekly restart. Ok, done! Self healing achieved! Not quite. This is just automation running on a schedule. A much better solution would be one that knows when to restart based on real data. Setting up monitoring that collects data like JVM heap usage, active thread count, garbage collection time spent, CPU usage, etc, would allow us to trigger the automation when certain conditions are present. This is much better because the once a week restart was just a guess at what it would take to keep the app running healthy. The reality was that we would sometimes need to restart the application more than once a week. These were always emergencies, and they were very disruptive to both users and the ops team. Having something in place that could perform the proactive restart as needed would be so much better. Combine that with a high availability solution to eliminate downtime and we have a self healing system!

Steps to get to a self healing system…

If humans are doing it over and over, or on a schedule, then it’s a good candidate for automation
Collect data through monitoring tools and look for patterns that align with the symptoms.
Trigger the automation based on conditions identified via monitoring
Adjust thresholds and triggers

In the modern world, the use of application load balancers, auto scaling groups, and ephemeral systems makes it much easier to build a self healing system. Ephemeral systems eliminate the drift that was inherent to old long lived servers in the data center. If the system starts misbehaving, you can kill it and let the ASG fire up a new one at a known configuration. You’ll want to try to recreate the conditions that caused the issue in your test environment, in order to get to the root cause of the issue. If it’s not clear what conditions may have caused the problem, you’ll want to add more logging or monitoring to help lead you to the answer.

Self healing systems

Streaming TV while traveling the world

Why I switched from Wordpress to Jekyll