Implementing Ephemeral Jenkins Masters with Kubernetes:...

In a recent blog post, we discussed the benefits of using ephemeral Jenkins instances as opposed to using a single Jenkins master with static build agents. Ephemeral Jenkins instances eliminate a single point of failure, relieve the DevOps team from having to handle every permutation of a pipeline that development teams might need, give autonomy to development teams, and keeps agent scalability dynamic to reduce the chance that build queues will grow.

To stay true to the Star Wars Stormtrooper theme from the previous blog, we will focus on how to produce your own army of Jenkins build nodes on demand, but unlike the Empire, we can send them away once they serve their purpose. Specifically, I’ll talk about how ephemeral Jenkins can be implemented from a technical perspective. I’ll highlight the tools used and the code structure, as well as include some info on how to get some pieces running locally.

Ephemeral Jenkins Technical Implementation

The first question to answer is where the cluster is going to run. For this implementation, we assume that you have access to a Kubernetes (k8s) cluster. If you do not, we believe that building ephemeral Jenkins is a great gateway for introducing Kubernetes to your organization. Depending on whether you’re running your k8s cluster on a public cloud provider or self-hosted VMs, your options could include the following:

Public cloud options such as: EKS from Amazon, AKS from Azure, or GKE from Google
Private cloud options, if your enterprise is not yet ready to explore public cloud, no worries do not let this impede your usage of kubernetes
~Self managed platform such as OpenShift on premises or via Tanzu on vSphere
~Self-hosted VM’s for bootstrapping and running upstream kubernetes clusters using tools such as Kubespray

The choice of cloud provider will not matter with the overall solution. We have successfully implemented ephemeral Jenkins on AKS, EKS, OpenShift, and native upstream Kubernetes. The code will be designed as modular such that different providers may be easily swapped in or out. Ultimately the solution will come down to the provider you are most familiar with and/or what technology your organization currently has or is willing to support. What’s important is that you have a robust Kubernetes cluster somewhere to begin building and testing the code. If this is an opportunity for you to begin working with public cloud providers and would like some insight on what’s out there, be sure to check out our blog on the different public cloud offerings and let us know your thoughts.

For cluster orchestration, we’ll be using Terraform and Terragrunt, which provide several benefits. First, Terraform’s module structure is great for being able to turn certain components of a cluster on or off. Second, state is also tracked in Terraform, giving you a reference point for what is happening in the cluster. Third, Terraform has both Kubernetes and Helm as native providers, so the tooling integrates nicely. Fourth, Terraform is agnostic towards cloud providers, so less code needs to change when switching from say, EKS to GKE, or on-prem to a public cloud. Finally, Terragrunt helps keep Terraform code DRY and makes it easy to separate modules across repos. In general, Liatrio’s approach is to stay cloud native (vendor and cloud agnostic) and closely aligned to Cloud Native Computing Foundation (CNCF) supported tooling. This gives the enterprise the ability to move workloads and change providers with minimal toil and effort.

Besides Kubernetes, Terraform, and Terragrunt, we will also be using Helm to help orchestrate everything in the cluster. Helm lets us keep all our Kubernetes manifest files as templates, so any configuration changes we would like can be quickly implemented by passing in different values. This will be critical when it comes to giving each team their own unique Jenkins instance.

Other tools will be expected to be integrated and deployed to the cluster as well. Being able to deploy Jenkins really is only the beginning. Some questions that may come up immediately are: How do we implement standardized logging or standardized tracing? How do we monitor the cluster and gather metrics? How do users access the cluster? To make the system secure, reliable, and observable, there is an additional layer of shared services that need to be included with each cluster. This could include services like Istio or Consul for providing a service mesh, Jaeger as a tracing tool, Open Policy Agent (OPA) for policy enforcement, Keycloak for authentication and authorization, etc. What tools you pick will depend on your situation. What’s important is that you understand that there will be more questions and more problems to solve once the initial setup is done. The Cloud Native Computing Foundation is a great source for finding some of the latest and greatest tools that the community is honing in on.

Furthermore, each tool can exist as its own Terraform module. Putting each tool in its own module ensures its reusability. Depending on your preference, you can host all of the modules in one repo or each in its own separate repo. For the purposes of this blog, we’ll keep the modules all in one repo to provide a clearer picture of the code structure.

Let's Dig In…

Even if you haven’t decided on a cloud provider, that shouldn’t stop you from getting started. For this blog, we will be using our local machines to run Kubernetes through docker-for-desktop. Running locally is an essential piece of the solution and really is part of the beauty of using Kubernetes. Part of this solution is being able to deploy the platform the same exact way locally that you would when going to a cloud provider. Assuming you can gain access to the Kubernetes API of a cluster, you ought to be able to provision the platform using the exact same Terraform code.

You’ll recall in the previous blog, we discussed implementing an ephemeral Master per product line or business unit. To demonstrate what that would look like, we’ve set up a repository on Github, which we’ll be using for this blog. Here we will go through deploying a couple of ephemeral Jenkins instances for a couple of products, each with their own unique configuration.

Here are the versions (at the time of this blog publication) that we are using for this case:

Terraform - v0.13.3
Terragrunt - v0.23.29
K8s client - 1.16.6
K8s server - 1.16.6
Helm - v3.2.4

The Repo Structure

All our Terraform code will live inside the same repository. This will include the resources for the cluster creation as well as the module for Jenkins, along with future modules. In the enterprise, the resources would most likely be separated across repos to allow for more access control of the resources and logical separation of duties.

Here we have broken the code down into a couple of different directories:

Modules, which will hold the resources for each tool. These are meant to represent reusable resources grouped together to allow you to define your architecture in more abstract terms, rather than having to define each individual resource each time you want to deploy.
A local folder, which will define what modules to use for our local setup. Inside the local folder is where you will find the entry point for where we define what products we are creating, as well as what other tools to include as a part of our platform.

We’ll examine the modules directory first. Inside the modules directory, you’ll find the module for Jenkins as well as some other tools. We’ve included kube-janitor and downscaler as a couple of tools to be deployed as well to show how other tools can be included in our deployment stack.

Namespacing

Each Jenkins instance will be deployed to its own namespace, providing a logical separation of resources and giving teams more permission scope. How you allocate the namespace, whether it be by a line of business initially or straight to a Jenkins instance per product, should be driven based on the fluency of the organization and team. The namespace resource will be created through a Terraform module separate from the Jenkins module and the cluster Terraform resources. Because we would like to set up other pieces inside each namespace and we would like to reuse the module, it's important to have a separate module for namespace creation to be able to use it more abstractly.

main.tf file

The key thing to note here is that we are passing in our default resource limits for each pod in the namespace. The resource requests and limits are set up as variables, so for each namespace, these can be unique. Check out the variables.tf to see the defaults.

The Jenkins Terraform Module

Inside the main.tf file in the Jenkins module is where you’ll find the Helm release, namespace creation, and configuration resources for a Jenkins instance. We create a service account and a few different roles for each Jenkins instance. Since Jenkins will leverage the same namespace it’s in to spin up the pod agents, we need to give the ability to create and manage pods in the namespace to the service account. We also create a role for the ability to read secrets from the namespace, as reading secrets from the cluster is a way to populate the credentials in the Jenkins instance.

In the Jenkins module directory, you will also see some tpl files. These are template files that Terraform uses. In our case, we are using these to create the configmaps for the Jenkins configuration. Jenkins will read the configmaps in the workspace and automatically load them into the configuration for the Jenkins instance.

Shared Libraries Configmap in main.tf for Jenkins module

shared-libraries.tpl for Jenkins module

The Jenkins Terraform module will be used to stand up each of the Jenkins instances. It will leverage the stable Helm chart. In addition, all the default config is already provided and controllable. The Configuration-as-Code plugin is leveraged as well, so each team can control its own configuration with easy traceability through version control.

The Jenkins Helm Chart

For Jenkins, as well as any other tool, we recommend leveraging the public charts as much as possible for a couple of reasons. First, using the public charts ensures that you’re following the same standards as the rest of the community. Trying to build your own chart has the potential of drifting from best practices and also makes it more difficult for people to approach and understand. Second, there’s no need to reinvent the wheel. Most of the stable charts already support the wide range of necessities that come with each tool, such as the deployment, service, volumes, RBAC, etc.

The Helm Chart for the Jenkins master is on Github. The image used by this chart is the Jenkins LTS image on Docker hub; however, you are free to leverage your own internal image. The chart comes with a values.yaml file that will use a set of default values if there are no overrides. In our case, we will be passing in overrides for the following:

Plugins
Jobs
Pod templates

In variables.tf in the Jenkins module, you’ll see that these have all been set up as variables to be passed in along with some default values.

variables.tf for Jenkins module

Next, we’ll discuss how these modules get orchestrated from inside the local directory.

The first thing to point out is the Terragrunt file in the local folder and the one at the root of the repo. The terragrunt.hcl file in the local folder takes advantage of the terragrunt.hcl at the root directory by inheriting part of its configuration. This way, we don’t have to worry about defining our Terraform state files for each module, and it lets us keep Terraform state dry. While it may not seem like a huge advantage right now, being able to let Terragrunt handle Terraform state dynamically becomes huge when implementing at a larger scale. Terraform state management itself is a huge topic with lots to consider. We won’t get too deep into it right now, but at the very least, leveraging remote state is a must. We highly recommend going through the official documentation on remote state and having a strategy of how you will implement the storage of remote state.

local/terragrunt.hcl

terragrunt.hcl

You’ll also notice that the local/terragrunt.hcl file has a block for attaching extra arguments. Terragrunt works by passing along the command you ran to Terraform. You can always write out the extra arguments each time you run the Terragrunt commands, but in this case, this saves us from having to pass in our tfvars file each time.

The inputs.tfvars file holds a list of variable values to pass along to the modules. Specifically, these variable values are meant to be for any other modules in the toolchain that are used. The ability to toggle certain tools on or off and decide which charts to use is represented in this file. They all currently accept their Chart override values from other tpl files in the toolchain directory.

The products.tf file is where each Jenkins instance gets defined. Each product will be an instance of the Jenkins module and reference the same module source; however now we can pass in override values for the instance name, the job list, the plugins, and the pod template we would like to use. This way, each instance is created the exact same, just with its own unique configuration. In this case, both instances are using different plugins and different pod templates.

local/products.tf

Though the configuration options in this blog are scoped to plugins, jobs, and pod templates, the same principles can be extended out to the rest of the configuration of the instance. There really is no limit to what kind of configuration control you can provide, it really depends on what you feel comfortable letting teams control and what you would like to be centralized through the module, such as the shared libraries.

Validating the Jenkins Terraform Module

The first thing we will need to do is add the Helm repositories to our local cache.

To run through the demo repo locally, the entrypoint will be from the local directory. Navigate to the local directory and run terragrunt apply.

Terragrunt will pass the command and arguments to Terraform, and Terraform will then initialize the modules that will be used and pull down any plugins needed for the providers. Terraform, by default asks for approval before applying any changes to a stack. Answering ‘yes’ will apply the modules.

Enter a value: yes

Once the Terraform run has finished successfully, you can check the running pods in the cluster.

You should see two new pods; one for product-a, the other for product-b. Each should be running in their own namespace. A basic admin user is set up on both instances with a random password. The password is stored as a secret in the respective namespace of each instance. You can get the password like so:

You can then use kubectl port-forward to access the Jenkins UI from the browser.

Adding more Tools

Now we can try and add a couple of helper tools to the cluster. We’ve included kube-janitor and downscaler as a couple of examples to demonstrate how more resources can be added to the cluster following the same module pattern.

In the inputs.tfvars file, update the enabled values for kube-janitor and downscaler from false to true. Then rerun terragrunt apply. At this moment, Terraform reads the state file, determines that most of the resources already exist and that there are now two more resources to be added, these being the Helm releases.

Enter a value: yes

What’s Next?

I’m sure one of the next questions some may have is how would we take this and run it on some real infrastructure? Having a good understanding of the concept of Terraform providers will be essential for knowing how to provision multiple clouds using the same resources. Terraform has a massive list of providers to work with.

What about authentication?

In this example, we are only using basic Jenkins security. Usually, we would want to pull the user list from something like an LDAP server. This could also be configured per Jenkins instance via the Configuration-as-Code plugin, and so can be set up inside the Jenkins module as a configmap just like the plugins, jobs, and pod templates.

What about secrets?

The dilemma we have with Jenkins Configuration-as-Code is that secrets must be saved as plain text inside the YAML. The way to get around this is to leverage the plugin’s built-in support for pulling secrets from different sources. If you aren’t already using one of the secret sources listed in the plugin's documentation, we would recommend checking out HashiCorp’s Vault. Vault is a simple yet powerful encrypted key-value store. We can store our secrets inside the Vault server, and then by defining a few environment variables inside the Helm chart, Configuration-as-Code can tell Jenkins to reach out to Vault to pull them in to be replaced. This is only one instance where you will need to manage secrets; with the move to microservices and cloud native paradigm, there will be more and more secrets to deal with. We recommend using a tool such as Vault to help manage secrets otherwise, this becomes a nightmare for both the platform and delivery team.

The three environment variables needed to be set up inside the container for Vault to work are:

CASC_VAULT_PATHS, the path to the secrets on Vault
CASC_VAULT_TOKEN, the token to access Vault
CASC_VAULT_URL, the URL for Vault

For more information on how Vault and the Configuration-as-Code plugin work together, there is great documentation on the Github page for the Jenkins Vault plugin.

CI for creating new Jenkins Instances

Part of being able to scale here is not having to manually run terragrunt apply each time you need to add a new instance. Having a process in place and a pipeline to handle the work for you will be essential for letting the Terraform code work for itself. A minimal pipeline would have stages for some of the following:

Running terragrunt validate to validate the configuration files
Running terragrunt plan to dry run the configuration changes
Running terragrunt apply to execute (yes, Terraform has the ability to auto-approve)

How this pipeline can then be leveraged is through a simple webhook and PR process back to the Terraform code.

Users commit code to a master repo defining the team/set name and joblist information.
A webhook or scheduled repository scan triggers Create/Update pipeline for a new Jenkins instance.
Team/set specific Jenkins instances are launched from a template using a new name and joblist information.
The Jenkins instance is available for team/set dedicated work, such as container, application, or archetype/generator building.

In a future blog, we’ll dive deeper into the onboarding process as well as address how to successfully roll out the ephemeral Jenkins to the enterprise at scale. We have successfully implemented ephemeral Jenkins at several different customers, so we have plenty of lessons learned to share and pitfalls to avoid.

Implementing Ephemeral Jenkins Masters with Kubernetes: Part 2