Feature Flags: Speed and Safety

In software development the balance between speed and safety remains a pivotal challenge. This explores this critical dynamic, highlighting how the strategic use of feature flags can revolutionize software deployment processes. By carefully integrating feature flags into the development workflow, teams can significantly accelerate their deployment frequency while simultaneously enhancing the stability and safety of their systems.

This blog not only demystifies the concept of feature flags but also provides a practical blueprint for their successful implementation, where the intersection of speed and safety drives innovation and efficiency in software development.

The Debate on Deployment Frequency: Risk vs. Resiliency

There is an old and recurring debate which I have run into frequently throughout my career. The debate centers around whether increasing the frequency of code deployments into a product environment is less safe or more safe. Will an organization have more issues as they increase the rate of change in the system or less issues? The answer can be somewhat nuanced and depend on many factors, but what I have found is that, when done correctly, increasing the rate of change in the system leads to an increase in resiliency and a decrease in the frequency and severity of outages. “When done correctly,” is, of course, the important part. One of the most effective techniques I have seen for ensuring that increasing deployment frequency adds safety instead of adding risk is the use of feature flags.

A feature flag, or feature toggle, is essentially a conditional which allows code paths to be enabled or disabled at runtime through a configuration tool, API, or something similar.

The idea is far from new, people have been talking about this approach since the 1970s at least. But as the rate of change in production systems has increased in the last decade or so, the use of feature flags has proliferated substantially. When used intelligently, they can allow software teams to move significantly faster and, at the same time, provide a striking improvement to system stability and resiliency.

Visibility to users of a new feature is controlled by a feature flag.

How to Implement Feature Flags

Step 1 — Identify a small change which needs to be made and deployed rapidly.

This change could be a small bug fix, some refactored code, a small feature, or even an incremental step towards a larger feature. Almost any change can benefit from this approach. One type of change which is generally resistant to the use of feature flags is the updating of packages such as NPM packages to a new version. I have yet to find a good approach to using feature flags for this type of change (although I haven’t given up on it yet). Typically a strategy like blue/green or canary deployments will be a better fit for this type of scenario. Almost any other type of change to the code will benefit from the use of feature flags.

Step 2 — Add a conditional in the code

This conditional should check a configuration value and then enable or disable code paths accordingly. The configuration might be stored in a simple file, in an environment variable, in a database, or in a third-party tool. Many homegrown or third party feature flag tools have advanced functionality which allows incrementing flags by percentage or toggling flags using metadata like the geography or language sent along with the request. These advanced features are extremely helpful, but not required. Storing feature flag values as booleans in a file which can be updated at runtime, or in a database is already a great start. There are many excellent third party feature flag configuration tools available, so I typically recommend that organizations adopt one of these instead of building one from scratch.

The conditional in the code should be as close to the ingress of the request as possible. The farther down in the call stack the conditional goes, the more likely it is that the flag will require multiple conditionals spread throughout the codebase to effectively manage it. Often a single conditional farther up in the call stack, and better abstraction of the code being changed are sufficient.

Step 3 — Test and deploy the feature flag alone!

Once the flag configuration and the conditional in the code are in place, I like to deploy immediately, before I have even begun working on the feature. Call it FFDD - Feature Flag Driven Development, write the flag before you write the code. The reason I do this is that I want to test my flag with no risk that anything else might be broken. Once I can prove the flag is working, all subsequent steps are extremely safe, because everything I do from this point on is deployed dark, behind a disabled feature flag. In addition, I have found that reversing the logic of the flag is unexpectedly common, so testing it before any code has changed is a good idea.

I also like to bake simple monitoring into my feature flags. Something like counters reported to an APM system or simple logging can help debug issues with the flag itself. I have even built alerts into systems to notify developers if a flag changes state or if unexpected traffic is hitting a code path in spite of a flag which should have disabled it. These aren’t necessary for simple flags, but as teams start using more and more flags and the flags begin to interact, some basic monitoring solves many of the issues which can arise. I personally like to have an abstracted feature flag layer, a code package or service which handles these shared feature-flag related concerns. It is a fairly small investment that really pays off in the long term.

Step 4 — Start working on the change, deploying the change dark as often as possible

Once I have the feature flag configuration and the conditional in place and they have been deployed to production, I am free to move as quickly as I want with my code change. As long as the flag is disabling my new code path (and I tested that thoroughly in step 3), I can deploy to any environment whenever I want. This greatly helps with safe continuous integration, because my change will deploy dark in each environment, and stay dark until I enable the flag. Developers are more likely to integrate frequently and more likely to deploy their changes all the way to production if they know their change is safe, and these are behaviors that should be encouraged. I personally like to deploy all the way to production every time I merge my code into main, and if I don’t merge my code into main multiple times a day I start to get nervous. On a typical working day, I’d expect to personally do several dark deployments to production.

Step 5 — When the change is ready, route traffic to it by enabling the flag

Typically, as soon as the change is complete enough to be testable, it will be enabled on a small scale in a test environment. During the development process I generally like to see the flag enabled in any integration environments (even if it isn’t totally working yet), disabled in any staging environments, and disabled in production. As soon as the change is stable enough to be usable, I like to enable it in the staging environment so people can start using it. In more advanced feature flag systems, it is possible to route traffic based on user IDs or groups, so it is possible to have testers and product owners seeing the feature without breaking anyone else in the environment, which can be extremely helpful.

When the change is complete and passing all the tests (which could take hours, days, even months if it is a large feature), it becomes a product team decision to enable the new code in production. In some cases, such as a small bug or some refactored code, this might be as simple as the developers deciding that everything looks good and enabling the flag. In more complicated cases it might be up to the business to schedule enabling the flag to coincide with a marketing campaign, press release, promotion, conference, etc. I like to analyze what the process should be case by case when we initially set up the flag. It is a sort of “acceptance criteria” for the flag.

Who needs to sign off on this?
Who decides when the flag is enabled?
Who needs to be present when we enable the flag?
Is the flag a simple on/off boolean or can we ramp traffic to it by percentage, and if we can do it by percentage, what is the plan?
How will we know if something went wrong, and what is the plan for reverting the state of the flag?

These are all good questions to ask, and I usually do it when the flag is initially created, before I’ve even started working on the code.

More advanced feature flag systems allow the flags to be gradually enabled, either routing a percentage of traffic to the new code, or else using metadata like the user ID, the user’s geography or language, or some other value (or combination of values) to decide which code path the request follows. These advanced solutions are extremely valuable. As an example, a scenario I’ve seen commonly is that a developer refactors some code behind a feature flag, and then slowly begins to route traffic in production to the new code. At some point the code starts to exhibit performance problems, failures due to race conditions, or some other load-related problem. Since only a subset of requests were routed to the new code, most users are not seeing the issue, and it is simple to ramp the flag back down until it is handling the load, or else disable the flag completely. What would have been a disastrous scenario without the flag, affecting all users and requiring a potentially time consuming rollback of the deployment, is instead a small problem affecting only some users, and which is resolved with a quick and simple change to a configuration setting.

Step 6 — Clean up

Once all requests are using the new code, i.e. the feature flag is turned on or ramped up to 100%, then the conditional and the config settings can be safely removed. Feature flags are a form of intentional technical debt. The flags involves temporary code and configuration. They add complexity and need to be removed once they are no longer in use. I have liked to approach this by making two user stories each time I need to add a feature flag, one to make and deploy the flag, and one to remove it. The story for flag removal goes into the backlog to be brought in at a future date. Some teams might pull these cleanup stories in whenever a flag is no longer needed. Others might pull in a specific number of cleanup stories each sprint. I have even seen some teams designate a “tech debt” sprint periodically and use some of it to clean up unused feature flags.

However it is managed, when the flag has been enabled for long enough that everyone feels confident it won’t need to be toggled again, the conditional in the code and the configuration for the flag can be removed. It is often also possible to remove an entire leftover code path, the old code which was switched off when the flag was enabled. It is often easy and safe to remove that code at the same time as removing the flag. I have found this to be a great way to encourage developers to clean up unused code. Since they are in the code removing the flag anyway, it is often a very easy lift, and often a natural inclination, for them to remove the unused code at the same time. If they designed their change with good enough abstraction to enable a simple feature flag, it is also probably abstracted well enough to easily remove.

Implementation of a Feature Flag System

Here is a simple implementation of a feature flag system. It includes flag configuration (hardcoded for simplicity), a feature flag client, and the code which is to be changed behind the flag.

View GitHub Gist

‍Conclusion

Feature flags can be an extremely effective tool for accelerating the delivery time for development teams, while at the same time dramatically improving the stability and resiliency of the system. At a previous company, I saw one small team, who was doing a single large production deployment every two weeks, begin deploying 10-15 times per day on average. Their record while I was with the team was 64 deployments to production in a single day, with no outages or negative impact on performance. In fact by the time I left the team, they had not caused a single outage in production with a deployment in many years. The key difference which enabled this massive acceleration, and the accompanying increase in stability and resiliency, was effective use of feature flags.

At Liatrio, we don't just advocate for the implementation of these strategies; we actively collaborate with your teams to integrate them seamlessly into your workflows. Our experts guide you through each step, from identifying the right opportunities for feature flags to ensuring their effective deployment and management, all while prioritizing the safety and speed of software delivery.

About The Author

Geoff started writing software professionally in 2001 and has seen a little bit of everything during that time. His passion is for enabling software teams to move as quickly, safely, and sustainably as possible. Ideas like agile software development, CI/CD, DevOps, and platform engineering have been his primary focus throughout his career. Anything that gets software teams working smarter, more efficiently and effectively, and more sustainably excites him.

https://www.linkedin.com/in/geoff-rayback/