Eliminating Service Account keys in self-hosted Actions CI/CD environments

If you’ve been paying attention to the Federal guidance from the Whitehouse over the last couple of years, there’s a slice dedicated to securing the software supply chain. This shift isn’t a big shocker to anyone keen on cybersecurity or national security topics. Supply chains for physical goods are the lifeblood of a nation’s economy, the same goes for non-tangible goods like software and the platforms it runs on. If you’ve been in the field for a while, I’m sure you’ve heard about the Solarwinds hack, and plenty of others in the past could have been better publicized, considering how nefarious they were.

Imagine the disruption of the flow of physical goods such as eggs and grain, which could be a denial of service attack if we made an analogy to Cyber. It’s not the total disruption of goods and services that keeps me up at night. I have other concerns about software supply chain security that accomplish that goal.

Think about it like someone slipping poison into a grain shipment somewhere in Iowa before it reaches shelves in Boston. The effects would be catastrophic and deadly for anyone consuming these goods. It is, in effect, a terroristic attack because not only could it maim and kill, but it also instills fears in people that restrict their freedoms, ability to innovate and live a healthy life. Supply chain attacks on our software integration and deployment processes aim for similar outcomes. They can get even more dangerous depending on the skills and goals of the bad actors involved when considering the targets (oil refineries [1] [2], weapons systems, and satellites.)

One consistent theme I have seen in nearly every software team I’ve worked within was a lackadaisical attitude towards the Principle Of Least Privilege (POLP) within CI/CD systems automation. I am guilty of such DevSecOps sins before I was aware of the sophistication and sheer amount of these attacks going on daily and how life-altering they can be for our businesses and projects.

How often have you seen a Service Account in a cloud provider given an ownership role tied to CI and CD purposes? Are the privileges to execute these jobs manually in your CI/CD platform managed through a UI or code with a change history and applied via something like GitOps? Do you have actual audit trails for who performed what and when - and how far does that trail go back? Are “bot” users sharing highly privileged accounts in these systems to save on licensing costs? Are the CI role(s) given the same permissions as the CD role(s)?

These are just a few things to get your nogging moving and bring up some important points - by no means an exhaustive list of no-nos, as my daughter would say.

To inspire, I wanted to discuss a project that involved migrating from TeamCity to GitHub Actions for CI/CD efforts. The primary focus was a complete CI migration to Actions with some CD work for their serverless deployments. There were long-lived JSON formatted Service Account key files in each VM/runner performing these CI tasks, which also were shared with CD pipelines. This layout made permissions management easy on the GCP Service Account side - everyone inherited these SA permissions for numerous projects across the enterprise portfolio in their pipelines. The team had locked down who had access to each project as a countermeasure, but a simple fact remained. If you had permission to modify any project’s pipeline in TeamCity, you could quickly obtain ownership in nearly all GCP projects/environments used by the engineering teams. I understand and support trusting the engineering team with this kind of responsibility (DevOps!). Still, the one significant mistake was trusting the code they committed in PRs to be allowed to live in a VM containing highly privileged Service Account key files.

All it would take for complete and total ownage of our systems was a single bad actor slipping some dotfile malware into a library inherited by upstream dependencies referenced in our source code (npm libraries, Go modules, Ruby gems, etc.) Plenty of open-source tools and libraries are available to detect secrets quickly. Many people reading this could write a one-liner to ship that off to some storage bucket somewhere, and the rest would be history.

Service Account keys should be going extinct. The rewards are just so fantastically juicy - everyone should be working to eliminate keys whenever feasible from their stacks and platforms.

This solution focuses on those using self-hosted Actions runners in a GKE or similar cloud provider environment that offers Workload Identity as Google’s Cloud Platform provides. Azure offers Managed Identities, and AWS has Service Roles. To boil this down - we can bind resources to a Service Account / Role and allow a workload running in a VM or Kubernetes Pod to inherit an identity and a set of permissions in our cloud providers without using a token/key to authenticate into that role. There are options to enable these workloads to inherit permissions across cloud providers if your team is multi-cloud utilizing Federated Workload Identity in GCP, and the other significant players like Azure and AWS offer similar solutions there, too.

A rather elegant way to do this within GitHub Actions is to leverage the runs-on label that all Workflows must utilize. This unique label tells our Actions platform “who” can execute these Actions Workflows. Creating a catalog of unique runner sets and identifying them by their labels opens up a potent paradigm for POLP and governance in on-prem GitHub Actions. It allows us to create runner groups distributed by purpose (e.g., CI vs. CD) and by the team if desired (Data vs. Engineering.) These runner definitions typically will be YAML files that live in source control and utilize GitOps or PR flow to apply them when approved and merged.

Kustomize works well here, and you can use it to create a base template that teams can build on their own for their unique purposes and needs. If they need a new runner with eight cores and 32 GiB of RAM, they can create a PR in your Actions Runner repo to enumerate this and get a +1 from your team before merging and auto-deploying this new runner set to your Actions platform. Utilizing a standard naming convention for the label, like ci-eng-ubuntu22-lite or similar, is easy to remember for folks and quick to pick up. ARC allows for an array of labels for runner descriptions in Kubernetes, so you can also have some defaults that inherit the latest and greatest, so if users don’t care what distro of Ubuntu is handed to them and only about resource commitments, a simple ci-eng-lite might work, too.

The wonderful part about this is that each label points to an Actions runner definition in Kubernetes, including a Kube Service Account (KSA) mapping. If your cloud provider allows you to map that KSA to their Identity directory (IAM) and inherit roles as a Pod: congratulations, you now are on the path to eliminating long-lived keys within your Actions workflows and CI/CD pipelines! The next step is to create Markdown documentation in this repo, holding your runner definitions, enumerating all potential options for your teams to utilize, and what permissions are mapped to each runner type/label you offer. There are some automation opportunities in there also to scrape these permissions grants and auto-document them if someone is enterprising enough!

The security hardening opportunities here are fantastic at this point. You use the same Kubernetes cluster and underlying VMs/nodes to run both your CI and CD Actions Workflows cutting down on overall maintenance/toil load. The label applied to each workflow dictates what they can and cannot do. So not only do you eliminate Service Account keys/secrets, you are now removing other supply chain attack vectors along the way. If a nefarious library gets into a PR by one of your engineers and you’ve appropriately scoped the CI runner roles in IAM, the most they will likely be able to do is pull down your enterprises’ base container images from a registry and some non-production secrets if those are being baked into images (which will need to be another article I think as this one is getting quite long!) No longer will that seemingly innocuous PR adding a new dependency (with a nefarious library now being pulled in) be able to delete a production Cloud DB or create 200 c5a.24xlarge instances and anonymously mine crypto on them while you foot the bill.

The next step is to now set up governance for what GitHub repos can access which runner group(s) so you can ensure the right teams or projects get the runners and access they need. Generally this would be handled by something like Terraform and a GitOps flow for automation and ease of auditing through GitHub Enterprise Actions Policies, and individual repo or Org-level runner group access enumeration depending on how you’ve laid-out your GitHub Enterprise structure.

There are other security-centric choices we can then make to further harden the Actions runner environment as our CI/CD pipeline provider. Including moving to ephemeral runners that are deleted after each workflow completes (which also has a side benefit of making our pipelines more deterministic.) Along with hardening our workflow runtime environment with a project like Sysbox that gives us a container runtime replacement and emulates a root-like experience to provide more compatibility with GitHub.com’s Azure VM runners (and public Actions like those supplied by Docker) which are handed root access at runtime. I will cover some of these topics in future articles and link back to this post when that time comes.

Please feel free to contact me if you’d like more details or if you have found any inconsistencies in this article. Hopefully, you have finished this article with better insight into how it’s possible to eliminate Service Account secrets within your self-hosted GitHub Actions CI/CD environment and the many benefits of doing so at a foundational level using automation and DevSecOps best practices.

- Kevin