The blog post you’re reading is hosted on a private Kubernetes cluster that runs inside my home. Another workload that’s running on same cluster is atuf.app – Amsterdam Toilet & Urinal Finder (blog post). A web app which I created to help keep Amsterdam streets clean by helping folks find a nearby toilets & urinals.
Since its release atuf.app was featured on popular web sites like DutchNews:
As the availability & uptime of ATUF app, this blog and few other private workloads become even more crucial. I kept thinking how to improve the reliability of these services.
Problem of running apps on a private Kubernetes cluster?
I created my private cloud & Kubernetes cluster with reliability in mind. It’s running on 3 RPI nodes, even if one RPI nodes died, it would continue running as nothing happened. Even if all 3 nodes went out, I created rpi-microk8s-bootstrap project which can be used for automated provisioning & setup of Ubuntu server & MicroK8s on a Raspberry Pi (RPI) node using Terraform. Which allows me to re-create and recover my cluster in matter of minutes.
Data is stored on Synology DS920+ NAS with 32TB of storage (4x8TB with SHR + Btrfs). In this setup, even if one of the disks completely failed, it would continue running as nothing happened. All data is backed up to another Synology DS415+ NAS, with 4 disk array with 1 disk redundancy.
Even in case of total power outage at my home, UPS unit would keep the cluster online for some time. I setup monitoring and I’m alerted on both email & SMS in case any of my sites go down.
But even with all these actions taken, accidents still happen and are a great way to ruin your weekend. If you’re not home or a laptop isn’t close by, SSH-ing to your cluster from smartphone to troubleshoot your app or K8s isn’t the best experience. Not to mention, at the time you might not even be able to do this if you’re at park with your kids, where it’s much more important to take care of them then your K8s cluster and its workloads.
Hence, lately I’ve been thinking a lot what would be the ideal failover and/or DR (disaster recovery) solution that would allow me to recover with minimal time & effort, or maybe even completely move to it, and one solution checked all the set criteria, Cloud Run.
Cloud Build & Cloud Run: A perfect solution for K8s cluster failover & DR?
Cloud Run is an execution environment based on Knative, a serverless platform offered by GCP (Google Cloud Platform). As such it won’t incur any costs if it’s in idle state and unless there’s incoming traffic. It allows you to deploy and run containers without managing the underlying infrastructure and automatically scales your workloads to meet demand. Which sounds like a perfect fit for containerized app that’s utilizing Kubernetes & HPA (horizontal pod autoscaler) for scalability purposes.
Cloud Build on other hand is a (serverless) CI/CD service on GCP which allows you to automate building, testing, and deployment of containerized apps. Something I couldn’t pass on, considering how seamless of an experience it is to build code with Cloud Build and then (automatically) deploy it to Cloud Run.
Automate app deployments (Github to Cloud Run) with Cloud Build
Since I couldn’t find these end to end workflows anywhere on the internet. I created this blog post & Youtube video hoping it could serve as an ultimate guide if you’re thinking of connecting your app hosted on Github repository with Cloud Build & Cloud Run.
For those who don’t want to watch the video and are only interested in Terraform & code part, I’ve created a reference atuf.app-deployment repository.
In above video I’ll go through 2 different scenarios how to do this. For context, app I’ll be using, atuf.app is a Python Flask web app which is hosted in Github repo:
Please note: that for each step there will be list requirements which I go through in detail in video above. While some of them seem like a lot of things that need to be done for automated procedures (especially step 2), most of the steps need to be done only once during the initial project setup.
1. Automatically Deploy app hosted in Github to Cloud Run using cloudbuild.yaml (Cloud Build)
Requirements:
- 1. GCP project
- 2. Containerized app with repo in Github (or elsewhere)
- 3. Clone atuf.app-deployment repository with all code which you can use as a reference.
- 4. Enable API’s & services:
- 5. Increase Cloud Build API quotas
If this wasn’t done I would run into following error in europe-west4 region:
“Failed to trigger build: failed precondition: due to quota restrictions, cannot run builds in this region. Please contact support”- 1. IAM & Admin > Quotas
- 2. Filter on: Cloud Build API
- 3. Filter on: Concurrent Build CPUs (Regional Public Pool) per region per build_origin (default)
- 4. Cloud Build API Limit was set to 0
- While filtered quota is selected:
- 5. Click on edit “Edit Quotas” button
- 6. Fill out request, example:
After quotas limits are requested you’ll get a confirmation email and it may take up to 2 days for quotas to be increased. In my case they were raised in less then a day, but your mileage might vary.
- 6. Create “cloud-sa” service account
- add roles under “Define additional roles to assign to cloud-sa” terraform:
- roles/iam.serviceAccountUser – role needed for Cloud Build & Cloud run
- roles/logging.logWriter – needed for logging
- roles/artifactregistry.admin – needed for “Authenticate with GCP Artifacts Registry” with cloudbuild.yaml to work
- roles/run.developer – needed for cloud run to work
- roles/run.admin – needed for “Allow public (unauthenticated) access” with cloudbuild.yaml to work
- add roles under “Define additional roles to assign to cloud-sa” terraform:
- 7. Create Artifact repository
2. Automatically deploy app hosted in Github to Cloud Run using Cloud Build (cloudbuild.yaml) & avoid any “ClickOps” with Terraform
Requirements:
- 1. GCP project
- 2. Containerized app with repo in Github (or elsewhere)
- 3. Clone atuf.app-deployment repository with all code which you can use as a reference.
- 4. Enable API’s & services:
- 5. Increase Cloud Build API quotas (explained in detail in “Step 5” of previous step)
If this wasn’t done I would run into following error in europe-west4 region with Terraform resource “null_resource” “build_trigger_run”:
ERROR: (gcloud.builds.triggers.run)
FAILED_PRECONDITION: due to quota restrictions, cannot run builds in this region. Please contact support
Since google_cloudbuild_trigger Terraform resource doesn’t have ability to automatically trigger the newly created Cloud Build trigger. I had to resort to some hacks to trigger it manually with “gcloud” command and then using some clever engineering create a polling mechanism to wait until “atuf-tf-trigger” Cloud Build run has been completed successfully and there’s a running Cloud Run service. This whole procedure is explained in detail in video above. - 6. Manually connect and authenticate Github repository to Cloud Build Repositories for which Cloud Build API will have to be enabled first. Unfortunately this can’t be done automatically with Terraform as explained in closed terraform-provider-google bug.
- 7. Create “tf-deploy” account with following roles:
- editor
- Project IAM Admin
- Service Account Admin
- Service usage Admin
- Eventarc Event Receiver
- 8. Create and download tf-deploy service account (json) key to key to:
~/.credentials/tf-deploy.json - 9. For project_services Terraform module to work, add following line as part of ~/.zshrc or ~/.bashrc (or other Unix shell of choice):
GOOGLE_APPLICATION_CREDENTIALS="~/.credentials/tf-deploy-dev.json
and export it by running i.e:source ~/.zshrc
- 10. Authenticate with “tf-deploy” service account with “gcloud” command for i.e Cloud Build automated trigger run polling mechanism to work as part of one of above mentioned steps, i.e:
gcloud auth activate-service-account tf-deploy@fooctrl-312814.iam.gserviceaccount.com --key-file=/home/ahodzic/.credentials/tf-deploy.json --project=fooctrl-312814
Verify by running:gcloud auth list
- 11. (optional) Add principal service account (tf-deploy) to domain name which will be used if custom domains will be mapped to Cloud Run services.
- 1. Full tf-deploy service account email needs to be added as an “owner” for domain that will be used as part of Google Search Console.
- 2. Google Search Console > Settings > Users and Permissions > Add User
Terraform steps (referenced in atuf.app-deployment repository)
After requirements steps have been completed, perform the following Terraform steps:
- Update terraform.tfvars with values you want to use
terraform init
terraform plan
(optional)terraform apply -auto-approve
- Perform steps in imports.tf to import Cloud Run service resource originating from cloudbuild.yaml so it could be managed by Terraform and i.e destroyed along with other resources.
Happy hacking & if you found this useful, consider becoming my GitHub sponsor!
Comments