Bogdan Katishev works as an Open Source consultant for the VRT CORE team which is part of the Digital Production Center (DPC). The goal of this team is to support product delivery, product stability and security. The VRT CORE team provides all DPC teams with stable, market-compliant and cost-efficient streams to integrate into an end product.
Continuous updates, a buzzword that is a synonym for many other popular terms like: rolling release, rolling update, or continuous delivery. These terms describe the same concept, which is the frequent delivery of updates to software or applications. In this article you can read how we reduced our update times from days to a couple of hours.
In reality there are a lot of practical issues involved when trying to keep software up to date.
Human error is one of those issues. Manual deployments require knowledge and the involvement of one or more persons. Mistakes can easily be made when deploying updates in large environments. These mistakes can go from lack of testing to simply overlooking something by accident.
Another issue can be: a deployment fails and we have no “plan B”. There is always the possibility that a solid and fully tested deployment breaks down. There is never a 100% guarantee for success when deploying updates. You will always need to have a backup plan.
Also, there are a lot of steps that need to be taken into consideration. There is also a chance that each one of these ‘steps’ can go terribly wrong, which could result in downtime.
The amount of time and manual work that is required made us rethink the process and look at how all of these steps could be automated instead.
We, at the CORE team, tried to research and re-evaluate each one of these steps that we used in the past and we came up with the following procedures/steps to safely automate our updates:
- Status viewer
- Type of servers
- Rollback strategy
- Deployment strategy
Splitting up the automation procedure
The first question that we had was: “how will we know when a new update is available?”. Of course we want to have an overview of all the updates that need to be done. After the update we also want to store this “new” latest version number somewhere and keep a history of all the previous versions.
As an answer to the first question, we wrote a Jenkins Pipeline which fetched all the latest versions of all the software we have in use from Github API. At the end of the Pipeline, we send the output to our team Slack channel. We also scheduled this Pipeline to run every week. This way we can keep track of the availability of a new update.
For the second problem, where to store this ‘new’ latest version, we have researched many tools (SSM Patch Manager, SSM Inventory, Foreman, AWX, Prometheus/Grafana, etc…) and in the end we decided to use the AWS SSM Parameter Store. Where we (over)write the new version to the Parameter Store before each update. This way we also have a history of all the stored versions.
Type of servers
We use 3 kinds of servers in our team: immutables, Puppet-managed and Elastic Beanstalks. This diversity makes the entire setup more complex because each ‘type’ of server has its own life cycle.
To update an immutable server we have a Jenkins Pipeline in place. This Pipeline builds a new AMI (Amazon Machine Image). Then we use Puppet to bootstrap everything and to fetch the latest value from the SSM Parameter Store to install the latest version of the software we need. Post AMI build we run some final tests on the image itself: check if certain packages are available and run some commands to check if the packages are working.
Updating Puppet-managed instances
To update a Puppet-managed instance, we have a Jenkins Pipeline in place which first writes the new version that we want to install to the SSM Parameter Store of the staging instance(s). Then we trigger a Puppet run on that instance. After that we perform some health and readiness checks on the staging instance(s) to check if the service is responding to certain requests. We repeat those 3 steps for our production instance(s).
Updating Elastic Beanstalks
Luckily for us, Elastic Beanstalks have an option called: managed platform updates. This service schedules automatic platform updates, in our case every week, on all of our Elastic Beanstalk environments. On top of that, we have monitoring that tells us if a stack has been behind a version for more than 7 days. This could indicate that there is a misconfiguration in one of the Beanstalks (managed platform updates option not checked) or that the platform updates are simply failing.
The third thing to take into consideration is: choosing the right strategy to roll out new software. This is important because each application or server has different requirements. One of them is availability.
There are a lot of different deployments strategies that can be used. One of the most basic and commonly used is: the rolling deployment strategy.
A rolling deployment gradually replaces old servers/applications with the newer version of the software. When using a rolling deployment, it usually waits until the new instances are ready via a health or readiness check, before it will start removing the old instances.
We use the rolling deployment strategy when updating our immutable instances that are in AWS Autoscaling Groups.
We have a mix (yes, welcome to reality) when it comes to deploying updates to our Puppet-managed instances.
Some of the services that we use do not have the support for a HA (high-availability) setup. For these servers, we usually calculate and announce 5–10 minutes of expected downtime.
Some of our Puppet-managed instances, that run services with support for a HA setup, are in ASGs (Autoscaling Groups). Here we also use the rolling deployment strategy.
We have a mix when it comes to deploying updates to our Elastic Beanstalks. Every team at VRT can choose its own preferred deployment policy. Since every Elastic Beanstalk will have a different technology stack.
Each deployment policy also has its own pros and cons.
Some of the Beanstalks that we have, use the immutable deployment policy. The benefits of using this strategy are: impact of failed deployments are minimal, zero downtime, code deployed to new instances only.
Some of our other Beanstalks use the all-at-once deployment policy. The benefits of using this strategy are: deployments are simple and fast and all of the versions are immediately in sync.
Having a rollback strategy is very important because in reality every update or update attempt can go completely wrong. Sometimes, when installing the latest version of a specific software package, unforeseen new bugs or breaking changes can appear, which will result in a broken application or server. When this happens, you will need to have some kind of rollback strategy in place to avoid any downtime.
We, at the CORE team, already experienced these kinds of situations in the past and it was not a pretty sight.
Most of our immutables are in AWS Auto Scaling groups (ASG). These ASGs support the rolling deployment strategy. We also attach lifecycle hooks to these ASGs, these are custom health checks which we use to determine the health of the service during a rolling deployment. As explained in the previous section, if a significant issue occurs, the rolling deployment will be aborted and the old instances will still be available and untouched.
We try to limit the impact of failed updates in production by testing them properly on our staging servers first. In our “software-rollout” Pipeline, we first roll out the update on the staging server(s). After that we have an extra Pipeline stage called: acceptance testing.
In this stage we execute some basic and custom health checks and tests. A few examples are: check if the applied Puppet run has returned with no errors, check if certain endpoint(s) of the service is/are responding. This is a stage where we can be extremely flexible and specific with our tests. Because every software/service is different and this way we can focus our tests on the most critical parts of the specific software that we are testing.
As mentioned before, Elastic Beanstalks have an option called: managed platform updates, which we use for our Elastic Beanstalks. This option also has a rollback mechanism and it depends on enhanced health reporting to determine if the application is healthy enough to be considered as successful. We use something called: ebextensions to achieve this. If our custom ebextensions are failing (exit code > 0) at deploy, then it will automatically trigger a rollback.
Before, in the stone age, the time it took us to get software to the latest version was approximately 1 day of work per person. Here we are talking about simple single instance setups. In the past, if we had to update more complex setups, it took us about 2–4 days of work per person, sometimes even more.
Now, with the re-evaluated steps and automated processes that we use, it takes us about 1–4 hours of work per person to keep our software up to date. We also trust our automated process sufficiently, so we scheduled these processes to run automatically every week.
This way we have time to work on more important tickets in our sprint, rather than spending all of our time keeping software up to date. Next steps
Nobody is perfect, not even automated processes. We are always trying to improve our automation with new ideas and new ways of working.
For the future, we are aiming to make more of our servers immutable. Of course not every technology stack can be made into an immutable server, so we need to be reasonable there.
We are also aiming to make our tests more solid and generic to achieve proper testing outcomes for our different technology stacks. Updates about our progress in the next post…