Skip to main content

How to automate HashiCorp Vault backup and restoration in AWS EKS with Terraform

ยท 10 min read
Abdulmalik

So you've moved your organization secret management process to Hashicorp Vault on Kubernetes ? everything is working well, but you are about promote to production, this brings alot of questions about stability, recovery and fully opeartional vault servicing your deployments.

That being said, but how do you achieve this, since you have an HA(High Availability) Vault working in your cluster already, that brings us to Vault snapshots, periodically taking and storing the vault snapshots in storages like aws s3 is the way.

Prerequisiteโ€‹

  • A working vault deployment in your cluster provisioned with terraform
  • A working S3 storage bucket to store your snapshots

So Lets jump into it;

Setting Authentication with Vault and S3 Bucketโ€‹

If i guess right, you probably thinking you are going get your aws secret key and that's what you will be using in authenticating vault and aws s3.

Well, No you won't be doing that, since your goal is too eliminate usage of secrets in config or plain form in the first place, time to set the auth process up.

๐Ÿ‘‰ Create an S3 Policyโ€‹

You need to create a new file named vault-backup.tf and add the following code.

vault-backup.tf
resource "aws_iam_policy" "vault_backup_access_policy" {
name = "VaultBackupPolicyS3"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
]
Effect = "Allow"
Resource = [
"arn:aws:s3:::YOUR BUCKET NAME",
"arn:aws:s3:::YOUR BUCKET NAME/*",
]
},
]
})
}

๐Ÿ‘‰ Provision IAM Role Service Account (IRSA) for Vault S3 Access in EKSโ€‹

before you proceed to creating an IRSA, you need to create a ServiceAccount that would be trusted by the assumed role you will be creating next.

vault-backup.tf
resource "kubernetes_service_account_v1" "this" {
metadata {
name = "vault-snapshotter"
namespace = "vault"
annotations = {
"eks.amazonaws.com/role-arn" = module.vault_irsa_role.iam_role_arn
}
}
# automount_service_account_token = "true"
}

time to create the IRSA, as you can see, i am using an irsa for eks module instead of going through the route of creating roles and attaching policy document, this module makes creating IRSA for EKS more clean and fast to create

vault-backup.tf
module "vault_irsa_role" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "5.20.0"

role_name = "hashicorp-vault-snapshot"

role_policy_arns = {
policy = aws_iam_policy.vault_backup_access_policy.arn
}

oidc_providers = {
ex = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["vault:vault-snapshotter"]
}
}
}

Thats all, we are done with configuring the auth process for the aws s3 bucket access from the kubernetes cluster.

Setting Up Vault Kubernetes Auth Engineโ€‹

Hope you are aware there are different ways to authenticate with Vault, your root token, approle, github and more, but we won't be going for any of this, we would be proceeding with vault kubernetes auth engine.

First, you need to create a new namspace called vault-client.

vault-backup.tf
resource "kubernetes_namespace" "vault-client" {
metadata {
name = "vault-client"
}
}

then you will create a ServiceAccount in the vault-client namespace that vault can use to authenticate with in the cluster, the ServiceAccount is what you will give to vault to get the vault token(jwt) for access to carry out the actions you need.

vault-backup.tf
resource "kubernetes_service_account_v1" "vault_auth" {
metadata {
name = "vault-auth"
namespace = kubernetes_namespace.vault-client
}
automount_service_account_token = "true"
}

you will also need to create a cluster role binding with serviceaccount attachment to authenticate with other ServiceAccount withinthe cluster.

vault-backup.tf
resource "kubernetes_cluster_role_binding" "vault_auth_role_binding" {
metadata {
name = "role-tokenreview-binding"
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "ClusterRole"
name = "system:auth-delegator"
}
subject {
kind = "ServiceAccount"
name = kubernetes_service_account_v1.vault_auth.metadata[0].name
namespace = kubernetes_namespace.vault-client.id
}
}

You will have to create kubernetes secret with the ServiceAccount annotations which you will make available through data block, which will then be used to authenticate kubernetes in vault in the next process.

vault-backup.tf
resource "kubernetes_secret_v1" "vault_auth_sa" {
metadata {
name = kubernetes_service_account_v1.vault_auth.metadata[0].name
namespace = kubernetes_namespace.vault-client.id
annotations = {
"kubernetes.io/service-account.name" = kubernetes_service_account_v1.vault_auth.metadata[0].name
}
}
type = "kubernetes.io/service-account-token"

wait_for_service_account_token = true
}

The below codes is how you will get the secrets created from the ServiceAccount acessible for the next step in the Vault Kubernetes Engine.

vault-backup.tf
data "kubernetes_secret_v1" "vault_auth_sa" {
metadata {
name = kubernetes_service_account_v1.vault_auth.metadata[0].name
namespace = kubernetes_namespace.vault-client.id
}
}

๐Ÿ‘‰ Configure Vault Kubernetes Engineโ€‹

So you have created and given the ServiceAccount the permission to access other ServiceAccount through the cluster role binding, it's time to configure authentication with vault via the kubernetes auth engine option.

But before you proceed, you need to update your provider.tf file to include the vualt provider and your exiting vault server URL.

provider.tf
vault = {
source = "hashicorp/vault"
version = "3.15.2"
}

provider "vault" {
skip_tls_verify = true
address = "https://vault.YOURDOMAIN.com" //you can also replace it with your localhost server which you port forwarded https://localhost:8200
}

Once you have added your vault provider config, you can proceed to the next process, which is enabling the kubernetes auth engine and authenticating the cluster access to vault.

Since you might have this process available already in your vault, you would only notice a modification instead of new creation of kubernetes vault engine when you run terraform plan

vault-backup.tf
resource "vault_auth_backend" "kubernetes" {
type = "kubernetes"
path = "kubernetes"
}

resource "vault_kubernetes_auth_backend_config" "config" {
backend = vault_auth_backend.kubernetes.path
kubernetes_host = module.eks.cluster_endpoint
kubernetes_ca_cert = data.kubernetes_secret_v1.vault_auth_sa.data["ca.crt"]
token_reviewer_jwt = data.kubernetes_secret_v1.vault_auth_sa.data["token"]
issuer = "api"
disable_iss_validation = "true"
}

๐Ÿ‘‰ Configure Vault Kubernetes Roleโ€‹

Just as you know, using vault, accessing resources are configured using vault policies and vault roles, now you will be configuring the vault policy that gives access to the vault snapshot resources.

create a file named vault-backup-restore.hcl and save the below code into the file.

vault-backup-restore.hcl
path "sys/storage/raft/snapshot" {
capabilities = ["read", "create", "update"]
}

path "sys/storage/raft/snapshot-force" {
capabilities = ["read", "create", "update"]
}

Now, you can create the vault policy and vault role, the vault policy will be attached to the created vault role.

vault-backup.tf
// Create a Vault policy

resource "vault_policy" "snapshot" {
depends_on = [vault_kubernetes_auth_backend_config.config]
name = "vault-snapshot"
policy = file("./vault-backup-restore.hcl")
}

// Create a Vault policy and attach policy created above

resource "vault_kubernetes_auth_backend_role" "snapshot-role" {
backend = "kubernetes"
role_name = "vault-backup"
bound_service_account_names = [kubernetes_service_account_v1.this.metadata.0.name]
bound_service_account_namespaces = ["vault"] # Allow for all namespaces, try to use specific namespace here
token_ttl = 43200 //1 day
token_policies = [vault_policy.snapshot.name]
}

Creating the cronJob and job to backup and restore vaultโ€‹

It's time to test out what you have being configuring so far, the reason for going through the long route of the above configurations is to avoid using plain text vault token and aws credentials to authenticate just for the backing up and restoration of vault snapshots.

๐Ÿ‘‰ creation and backing up of vault snapshotโ€‹

The below deployment is the cronjob that continously create your vault snapshot and back it up to your aws s3 bucket.

you can adjust the schedule value to what timing works for you.

you also need to replace S3 BUCKET NAME with your own s3 bucket name, and you might not be needing the env variables of VAULT_CACERT, VAULT_TLSCERT and VAULT_TLSKEY if you dont have tls enabled in your existing vault setup.

vault-backup.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: vault-snapshot-cronjob
namespace: vault
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: vault-snapshotter
volumes:
- name: tls
secret:
secretName: vault-tls
- name: share
emptyDir: {}
containers:
- name: backup
image: vault:1.12.1
imagePullPolicy: IfNotPresent
env:
- name: VAULT_ADDR
value: https://vault-active.vault.svc.cluster.local:8200
- name: VAULT_CACERT
value: /vault/tls/vault.ca
- name: VAULT_TLSCERT
value: /vault/tls/vault.crt
- name: VAULT_TLSKEY
value: /vault/tls/vault.key
command: ["/bin/sh", "-c"]
args:
- >
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login jwt=$SA_TOKEN role=vault-backup);
vault operator raft snapshot save /share/vault-raft.snap;
volumeMounts:
- name: tls
mountPath: "/vault/tls"
readOnly: true
- name: share
mountPath: "/share"
- name: snapshotupload
image: amazon/aws-cli:2.11.21
imagePullPolicy: IfNotPresent
command:
- /bin/sh
args:
- -ec
- |
until [ -f /share/vault-raft.snap ]; do sleep 5; done;
aws s3 cp /share/vault-raft.snap s3://S3 BUCKET NAME/vault_raft_$(date +"%Y%m%d_%H%M%S").snap;
volumeMounts:
- mountPath: /share
name: share
restartPolicy: OnFailure

then you can deploy using kubectl apply -f vault-backup.yaml or deploy using terraform by adding the following to your exisiting terraform config.

vault-backup.tf
data "kubectl_file_documents" "cronjob-vault" {
content = file("./vault-backup.yaml")
}

resource "kubectl_manifest" "cronjob-vault" {
depends_on = [vault_kubernetes_auth_backend_role.snapshot-role]
for_each = data.kubectl_file_documents.cronjob-vault.manifests
yaml_body = each.value
}

if everything goes well, you should have something like this in your aws s3 bucket.

vault backup snapshot s3 bucket

๐Ÿ‘‰ pulling backup and restoration of vault snapshotโ€‹

if the situation arises and you need to restore yur backup due too cluster migration or unforseen disaster, this deployment is what you will use for restoration.

but when a cron job is successful, it would be left hanging around right, this has been eliminated using ttlSecondsAfterFinished: 1800 value, so after 1800 seconds, the job dissappears, you can always adjust it to whatever seconds you believe your restoration will take to be successful.

vault-restore.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: vault-restore-cronjob
namespace: vault
spec:
ttlSecondsAfterFinished: 1800
template:
spec:
serviceAccountName: vault-snapshotter
volumes:
- name: tls
secret:
secretName: vault-tls
- name: top
emptyDir: {}
containers:
- name: pullvaultbackup
image: amazon/aws-cli:2.11.21
imagePullPolicy: IfNotPresent
volumeMounts:
- name: top
mountPath: /top
command:
- /bin/sh
args:
- -ec
- |
last_file=$(aws s3 ls s3://S3 BUCKET NAME/ | awk '{print $NF}' | tail -n1);
aws s3 cp s3://S3 BUCKET NAME/${last_file} /top/vault-raft.snap;
- name: restore
image: vault:1.12.1
imagePullPolicy: IfNotPresent
volumeMounts:
- name: top
mountPath: /top
- name: tls
mountPath: "/vault/tls"
readOnly: true
env:
- name: VAULT_ADDR
value: https://vault-active.vault.svc.cluster.local:8200
- name: VAULT_CACERT
value: /vault/tls/vault.ca
- name: VAULT_TLSCERT
value: /vault/tls/vault.crt
- name: VAULT_TLSKEY
value: /vault/tls/vault.key
command:
- /bin/sh
args:
- -ec
- |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login jwt=$SA_TOKEN role=vault-backup);
until [ -f /top/vault-raft.snap ]; do sleep 5; done;
cd /top/
vault operator raft snapshot restore -force vault-raft.snap;
restartPolicy: Never

you can deploy the restoration job using kubectl apply -f vault-restore.yaml or using the following terraform code

vault-backup.tf
data "kubectl_file_documents" "job-restore-vault" {
content = file("./vault-restore.yaml")
}

resource "kubectl_manifest" "job-restore-vault" {
depends_on = [vault_kubernetes_auth_backend_role.snapshot-role]
for_each = data.kubectl_file_documents.job-restore-vault.manifests
yaml_body = each.value
}
note

When you have tls setup for your vault already, you will need to mount it also in the cronjob and job deployment to avoid errors relating to tls which would caused the deployment to fail.

I hope you've learned something useful from this blog to take home for kubernetes secret management with hashicorp vault oss reliabilty and damage control.

Till next time โœŒ๏ธ

Referencesโ€‹


Comments