In the previous article “zero downtime deployment”, I have introduced one method ‘side resource deployment’. This method is not easy to perform and not as good as other methods like Blue/Green, Canary,… But we have to apply it in many real projects. You can also consider Feature Flags to further enhance these methods. In this article, I will show you how this method solves the problem of replacing immutable servers in real projects.
Use case
Imagine that we design your system as an N-tier architecture on the AWS cloud as described below.
Above diagram is 2-tier architecture, it includes one instance to run applications and one database server. There’s no load balancer and autoscaling groups are applied, so It’s impossible to scale this server horizontally. It means only one instance runs the application and serves traffic at the same time.
This architecture looks simple and easy to manage with terraform, however, In order to replace this instance with a new one while the end user can still access the application is another story. To be honest, this is one of the most difficult implementations.
Imagine an instance is running on a production environment, but dev team asks you to install redis and some php extensions on it. What is your solution for this case ?
- Access the server and run command apt-get install redis ? Yes, It works perfectly until the server unfortunately encounters some problems and you have to replace it with a new one.
- My approach is to define all the package installation as code. It could be shell script, ansible, … to have a fully automated deployment. I would then replace the running server with a new one to have these settings.
Next, I will demonstrate a deployment method to replace a new server with minimal downtime. It mainly focuses on deployment method so It does not explain too much about Terraform code.
How to migrate immutable server with Terraform
Below is my current module to deploy the instance
module "ec2-instance" {
source = "../../modules/ec2-instance"
name = "ec2-instance"
instance_type = "t3.small"
ami = data.aws_ami.debian.id
user_data = data.template_file.user-data.rendered
}
User-data from this instance will deploy below components.
- We deploy Nginx as a proxy server to send traffic to the frontend and backend.
- We deploy Certbot to request an SSL certificate for the domain.
Server replacement with minimal downtime to application step by step
- Duplicate deployment module with new one and change the resource name “ec2-instance-new”
module "ec2-instance-new" {
source = "../../modules/ec2-instance"
name = "ec2-instance"
instance_type = "t3.small"
ami = data.aws_ami.debian.id
user_data = data.template_file.user-data.rendered
}
- Apply terraform to create new instance
When this instance starts, service nginx will fail because the certificate is missing. The nginx service on this new instance fails because the domain record has not been updated to the instance’s public IP. As a result, Certbot cannot authenticate the domain owner, and no certificate is issued.
In my case, I applied Elastic IP to have a fix public IP address for the instance, so instead of modifying the record on route 53, I update the Elastic IP address point to new instance
resource "aws_eip" "public_ip" {
vpc = true
instance = module.ec2-instance-new.id
}
Run apply terraform again, this time certbot will be able to request a certificate on the server and nginx will work perfectly. When the cloud-init job is done, the application starts and serves from this new instance.
To be honest, downtime already exists during the launch of new instance, this downtime is completely dependent on how you install tools, configure it and run applications. If the applications need lots of tools to start and you put it all on user-data, no need to say, it will take several minutes to start your application. So I highly recommend using custom AMI to prepare all needed tools. Reducing instance startup time will allow your application to start quickly.
A good deployment practice always comes with a quick rollback to the previous version. Typically, if the server initialization process is good enough you won’t have issues with this new deployment. But errors can happen, we can not eliminate them, just prevent them. It’s easy to rollback on this deployment method, as in the step above, we change the Elastic IP address to point to the new instance so we just need to point it to the previous instance and re-run apply terraform.
resource "aws_eip" "public_ip" {
vpc = true
instance = module.ec2-instance.id
}
- Migrate terraform state and clean up code
When the deployment has succeeded, the old server needs to be destroyed to save the cost.
I am about to modify terraform state, so you should back up the terraform state file to allow you to revert if something goes wrong. Believe me, you will cry if you don’t have a backup and the Terraform state gets conflicted.
Apply terraform state to change the resource name.
Change resource name of old ec2 instance, then run terraform mv to change resource name.
terraform mv module.ec2-instance module.ec2-instance-backup
Same for new ec2 instance.
terraform mv module.ec2-instance-new module.ec2-instance
You need to modify the Elastic IP to point to the correct instance.
Run terraform apply to refresh the state, then remove the ec2-instance-backup block and apply again to remove the old instance.
Conclusion
This deployment method is useful for standalone instances without auto-scaling group and load balancer. It is flexible to switch to a new instance and rollback if any errors. As I mentioned above, this approach still has downtime due to instance startup time and SSL certificate with let’s encrypt. We will address this issue in the next part of the Zero Downtime Deployment series.
Would you like to read more articles by Tekos’s Team? Everything’s here.
Author
