|Authors||Nimisha Asthagiri <firstname.lastname@example.org>|
|Arbiter||Eric Fischer <email@example.com>|
|Review Period||2018-02-14 - 2018-03-19|
A feature toggle is a software development technique that decouples deployment of code from release (enablement) of the code. When used with care, it can be a powerful tool in a continuous deployment environment to deploy changes incrementally with the main codebase while under development. Additionally, it allows a team to reduce the risk of breaking production systems by providing levers to progressively test changes and quickly disable changes if needed. For a large platform, it is also used to selectively enable a change/feature for certain users or certain deployments.
However, feature toggles inherently add complexity to the code since they introduce multiple paths and configuration operations, with the potential to create an explosion in testing permutations and to create forgotten latent unused code paths. A thorough understanding of their implications and best practices is warranted to use them in a large-scale system.
Given the many benefits of using feature toggles, edX development teams have been using them since the very start. Additionally, since deployment latency for edx.org has reduced from monthly->weekly->daily and since teams have been bitten too many times by long-term feature branches, teams now have a greater incentive to develop incrementally within the main branch. Updating legacy edX user experiences and features also drives the need for a framework where usability and behavior changes are gradually released and introduced to the edX user base.
Aligning on a common best practice for feature toggling is critical for both the short-term and long-term health of the system. Using different standards, strategies, testing procedures, etc., leads to confusion, production failures, and long-term maintenance issues.
At its core, feature toggles allow teams to deploy alternative code paths and to choose between them at runtime. There are multiple scenarios where this capability comes in handy. The following section enumerates the use cases that are typically relevant for edX features.
Note: A given feature/change may require being in multiple use cases during its lifetime. To support this, a corresponding feature toggle may transition through different use cases.
When introducing new features in the platform, teams may want to submit incremental changes without exposing unfinished work. They can do so with a temporary toggle that gates a single (or very few) high-level entry point(s) to the feature changes.
The desire for a release toggle may come from either the engineering team (as a way to incrementally implement the changes) or through the product team (as a way to temporarily hide user-facing changes of a large feature).
Note: Consider alternatives to using release toggles. Specifically, think iteratively and not just incrementally. That is, instead of having a grand toggle to unlock many changes at once, consider breaking up your feature into iterative verticals that can be released (enabled) a bit at a time. See Release Toggles Are The Last Thing You Should Do.
There may be a business case, advocated by the product team, to use a toggle to expose a new feature on a specific grand opening date. However, for a confident unveiling, this use case should be used in consideration with Ops and/or Beta Testing scenarios.
Dynamically controlling feature toggles, without needing to re-deploy an application, comes in very handy when considering the operational requirements for uptime metrics.
This use case is usually driven by the engineering team.
As teams balance the needs for rapid agile development while continuously deploying to a large-scale system with 99.99% uptime requirements, they need the ability to test new changes in production while having the ability to revert quickly. That is, moving rapidly and taking risks can decrease Mean Time to Failure (MTTF), which needs to be counterbalanced with the ability to reduce Mean Time to Recovery (MTTR).
When a team is concerned about potential performance or scalability issues with an upcoming change, gating the change behind a toggle allows the team to:
Once the team is confident about their change and the change is released to all users, they would safely remove the gating toggle.
In certain cases, the development team (in consultation with the operations team) may choose to extend the lifetime of an Ops toggle in the codebase even after releasing its gated feature. A small number of such long-lived Ops “kill switches” provide operators dynamic controls to gracefully degrade the system under high load. Operators can use these circuit-breaker capabilities either preemptively in the anticipation of a high-demand event or in response to taming an unanticipated high load or attack.
Typically, long-lived Ops toggles are useful for gating non-critical features that are very expensive on system resources. However, the long-term costs of maintaining the added complexity in the code should be measured against the benefits of operationally degrading the service when needed.
For user-facing changes, the engineering and product teams may choose to release them to a specific subset of the population before releasing to the rest. This is in contrast to the Ops - Monitored Rollout case where changes are rolled out to a random subset of users.
In the edX case, the Beta testing program may include the following types of population subsets:
The feature toggle is useful during the duration of the Beta testing period and is removed afterward.
There are sometimes business requirements for keeping long-term feature toggles in order to expose or limit certain features to certain groups.
The business may choose to modify the product experience for different classes of users. For example, the state of a feature toggle may depend on whether the user is a paying customer or applicable to a white label site.
In an extreme case, the business may choose to keep a feature disabled for a certain group (e.g., for a course or for an organization) in order to appease concerns about the change. However, as this introduces a roadblock to removing a toggle and its corresponding complexity, further effort should be made to tweak the feature to accommodate the group’s concerns and/or to make the group more comfortable with the change.
When a team implements a feature that they do not expect to be adopted by all Open edX instances, they may introduce a toggle to gate the feature. However, since there is a large cost to supporting long-term toggles, the following alternatives should be considered:
Note: Remember that feature toggles are not a substitute for clean architecture and SOLID design principles. Any long-term feature toggle should be carefully considered along with architectural patterns such as plugins, dependency injections, separable services and libraries with clear interfaces. Sometimes the need for a toggle can be completely eliminated. Other times the toggle may still need to exist but with much less complexity.
Note that we are excluding experiment toggles from this list of use cases. Experiment toggles are used to perform multivariate (A/B) testing in order to generate statistically significant results to make data-driven optimizations and feature changes. Users are placed in different experimentation groups that are associated with different code paths. The effectiveness of each code path is then evaluated by measuring its impact on users’ aggregate behavior.
This is a deeper topic that is worth exploring in a separate OEP (see Optimizely Tips and Tricks). For now, suffice it to say that edX uses an external A/B testing platform (Optimizely) to serve this purpose. Among other things, Optimizely supports user segmentation and targeting, data aggregation capabilities, statistical tools, and toggled code customizations. At this time, Optimizely is used for edX experimentation, customizing edX code, but without merging any changes to the edX codebase. This also contrasts with the use cases that are in scope of this OEP.
Note: Having described experiment toggles as a specific toggle type that is out of scope, other uses cases in this OEP may still be useful when implementing an experiment.
The following diagram summarizes the various use cases along 2 axes: feature maturity and longevity. Feature maturity corresponds to the level of certainty that the team has about the feature, including unexpected side-effects such as performance and user-behavior regressions. Longevity depicts the lifetime of the feature toggle and how long-lived it is expected to be.
The diagram also labels which use cases are primarily driven by engineering teams (E) and/or business product teams (B).
A feature toggle may transition through use cases as its corresponding feature matures. As illustrated in the following example, a toggle may start in an Incremental Release phase as the feature/change is being developed. Once it is ready for Beta Testing, it may be gradually released to individual users before exposing it to a few courses in the Beta program. Once the feature is further matured, it can be fully enabled, but may require select courses to Opt-out temporarily. Lastly, the feature toggle may be used to provide an Open edX Option for a single Open edX release before it is finally retired and removed.
edX teams should use a common framework to implement feature toggles and should follow best practices to test them and remove them. Before deciding to use a feature toggle, the engineering team, in collaboration with the product team, need to decide on the release and development paths that the feature will take so they can choose the right toggle type(s).
The following set of questions can help you determine the set of use cases required for a feature, as well as the required toggle type and its required duration. Answer each of the following questions and make a list of all use cases associated with an affirmative response, taking the “maximum” toggle type and “maximum” toggle durations.
The range of toggle types and toggle durations are:
|Question to ask||Team to ask||Use Case, Toggle Type, Duration|
|1||Is this a hypothesis-driven change that needs to be validated via an A/B testing framework?||Business and Engineering||Read Optimizely Tips and Tricks instead of this OEP.|
|2||Is the feature being developed incrementally and needs to be hidden while it is unfinished? If so, are you sure the development of the feature cannot be redesigned so it can be released in a more optimal iterative fashion instead?||Business and Engineering||
|3||Are there any operational concerns, such as unanticipated performance, scalability, or functional regressions, which must be confirmed in the production environment?||Engineering||
|4||Are there any user-facing changes for which you would like to receive feedback from select users or groups before releasing to everyone? Or are there any groups that want early access to the changes before they are officially rolled out?||Business and Engineering||
|5||Is there a specific big grand opening date for this feature? If so, is it really necessary for it to remain hidden until that time?||Business||
|6||Are there any specific groups that are adamant about opting out of the feature? If so, are we unable to convince them to adopt the feature in time of rolling it out to the rest of the users?||Business||
|7||Will other open edX instances want to control the availability of this feature? If so, are you sure other implementation alternatives, such as pluggability, are not possible for this feature?||Business and Engineering||
|8||Is there a long-term business requirement to expose or limit the availability of this feature to select groups, such as paid users or users accessing through a white-label site?||Business||
|9||Is this an expensive but non-vital functionality that would be useful to disable gracefully in a future event of high load or attack? If so, does the availability of the control outweigh the costs of maintaining the toggle?||Engineering||
The recommendation is to create a common edX framework on top of Django Waffle. Waffle provides a simple and intuitive API to dynamically configure toggles in a continuously deployed system, with toggles stored in a generic relational table. Waffle’s built-in capabilities satisfy some, but not all, of our Requirements.
For long-term sustainability and operational success, a Feature toggle framework should have the capabilities listed in the following table. For each requirement that is not supported by Waffle, further information is provided in the subsequent Details section.
|Requirement||Description||Supported by Waffle|
|1||Dynamic||It should be easy to enable or disable a toggle without deploying new code.||
|2||Self-serve||Individual teams should be able to control the values of their own feature toggles.||
|3||Removability||It should be relatively easy to remove a toggle from the system to encourage teams to do so.||
|4||Testability||It should be possible to test the different toggle states in the code even when they are not enabled.|
|5||Auditability||Operators and teams should be able to tell the who, what, and when of toggle changes.||
|6||Performance||The value of a toggle should be cached so it is not repeatedly retrieved from storage.||
|The 3 necessary toggle types are supported and easy to use by edX developers.|
|8||Non-collision||Feature toggles created by independent teams should not collide with each other. See Financial disaster caused by repurposing a feature flag for a scary anecdote.||
|9||Multi-tenancy||As edX uses Django Sites for multi-tenancy, there should be a way for any site to override the value of any feature toggle.||
|10||Least Privilege||As different toggles may have varying impact on the business, operators may want to limit who can edit certain toggles.||
|11||Discoverability||There should be a way for an operator to discover all available feature toggles in the system.||
|12||Report||There should be an administrative interface to retrieve information and status of existing toggles (e.g., description, type, dates).||
|13||Distributed||There should be administrative supporting tools to manage feature toggles across distributed service boundaries.||
The framework, currently started in the waffle_utils app in edx-platform, is a viable starting point for addressing the Requirements. It already has basic support for Requirements 1-8. Details below describe what would be needed for the remaining requirements.
The framework provides the following classes for the required toggle types:
Eventually, the following classes should be added if/when needed:
The waffle_utils classes require namespaces. The namespace should be unique to each Django app so it doesn’t collide with other installed apps in the system.
In order to allow White Label sites to override feature toggles, the framework needs to integrate with the edX Site Configuration feature. When a caller requests the value of a feature toggle, the framework should first check if there’s an override for the current site and return it instead.
If business-sensitive toggles are used that need to have limited access, the framework should be extended to support fine-grained write access to feature toggles. One possibility is to add a new “group access” field with each toggle and update the Django admin interface to enforce access.
The framework needs to be able to discover all waffle_utils classes declared in all installed Django apps in the system. Initially, the discoverability can be scoped to within each microservice, but ultimately accessed via a centralized tool across all distributed services.
To support this, the framework can make use of the Django App Plugin design pattern and search for waffle_utils classes declared in all installed apps. This requires that every app that uses waffle_utils declares its usages in a standard module (i.e., config.py) or configure its location (in its apps.py module).
In order to provide a useful and informative administrative report of the existing feature toggles in the system, the framework needs to be able to present the following information for each toggle.
|Report data||Purpose||Data source|
|Description||Brief human-readable information about its usage and context.||In code, by developer|
|Feature Category||Optional field to group interdependent toggles.||In code, by developer|
|All Use Cases||Lists one or more Use Cases to specify all expected usages of this toggle.||In code, by developer|
|Current Use Case(s)||A subset of “All Use Cases” to specify the current Use Cases of this toggle.||In code, by developer; optionally editable via admin interface.|
|Toggle Type||One of Switch, Rollout, or Group to further clarify the toggle’s usage.||In code, by developer|
|Created in Code Date||Required field to specify the date the toggle was added to the codebase; to easily find all stale toggles.||In code, by developer|
|Expiration Date||Optional field to specify target date of removal; to easily find all expired toggles.||In code, by developer|
|Current Setting(s)||Summary of the current configuration and value of the feature toggle; to easily evaluate its readiness to transition or retire.||Derived from relational tables|
|First Modified Time||Date the toggle was first set in the system; to get the starting date of its use.||Derived from relational tables|
|Last Modified Time||Date the toggle was last set/unset in the system; to easily find all unused toggles.||Derived from relational tables|
“Visible or not, you are still deploying code into production that you know for a fact to be buggy, untested, incomplete and quite possibly incompatible with your live data. Your if statements and configuration settings are themselves code which is subject to bugs – and furthermore can only be tested in production. ... Your features may not be as isolated from each other as you thought they were, and you may end up deploying bugs to your production environment.”
Given that, here are best practices for testing a Feature Toggle:
The following table summarizes test plans for the various toggle use cases while taking best practices into consideration.
|Use Case||Test Plan|
|Use Case||Test Plan|
As mentioned previously, feature toggles inherently bring along code complexity. In order to manage the “toggle debt”, we need to keep their inventory at a minimum. The framework’s Removability and Report features make it possible to do so. But it must be accompanied by a proactive process of actually removing the toggles and their branches within the code.
In addition to using the Report as a central tool for overseeing the toggles, individual teams should create tickets in their backlogs for removing toggles according to their intended expiration dates.
Although feature toggles have been in use from the very early stages of development on the platform, the Feature Flags and Settings on edx-platform wiki was one of the first documents to capture our thoughts on the subject. It includes preliminary discussions on best practices as well.
Additionally, there have been recent episodes with end-to-end test failures resulting from ad-hoc changes to waffle settings on a central Staging environment.
In order to support the Report and Discoverability requirements, existing feature toggles that use waffle_utils will need to migrate to the new framework. This migration should be done in a shortly focused effort as soon as the framework is ready.
Existing feature toggles that don’t use waffle_utils will need to gradually migrate over as possible.
edX applications that are not written in Django (for examply Ruby on Rails or Drupal applications) are currently considered technical debt. There is expectation they will eventually be rewritten or migrated. If in the meantime they need to use feature toggles, they cannot use Django-based waffle_utils and should therefore have their own application-specific feature toggle best practices document that applies to their own application.
Here are a few examples of usages of the waffle_utils classes:
Here are a few alternatives to using feature toggles.
As an alternative to using a Switch toggle for an Incremental Release, a team can work and make all their changes within a separate branch from the master branch. However, there are many pitfalls to using long-term feature branches, including drifting away from the main branch, resulting in a painful conflict resolution experience upon merging back. Even if the team rebases often with the main branch, their code remains hidden and untested by the rest of the organization, resulting in repeated merge conflict resolutions.
Specifying toggle configuration in environment variables or command-line arguments is difficult to coordinate across multiple nodes in a large deployment and requires redeployment and/or restarting each process.
Storing toggle configuration in separate files allows the configuration to be decoupled from the code and allows different deployments to override values. However, any change to the configuration requires a redeploy of the application.
Many features in the edX platform use JSON Configuration files to store their settings, including toggle configuration. It is recommended that features instead use a more dynamically configurable alternative such as Configuration Models or Feature Toggles, unless (1) the setting is security-sensitive or (2) is guaranteed to not need to change for a given open edX deployment.
Examples of security-sensitive data are secret credentials (API keys, private keys, etc) and private network identifiers (AWS S3 bucket names, external service hostname, etc).
A viable alternative to Feature Toggles is edX’ Django Configuration Model. Built on top of Django Models, it stores configuration in a relational table, provides an audit trail of changes, and supports granular permissions. Each feature creates its own Config Model, which allows the feature to include whatever additional Django Fields it requires. In fact, Config Models are the recommended framework for storing all non-boolean edX feature settings that need to be dynamically manipulated via Django Admin.
For light-weight boolean Feature Toggles, however, Config Models have proven to be difficult to clean up after use. The primary reason for this is that teams must manage a multi-phase rollout to remove columns or tables in a blue-green deployment since the previous version of the code continues to access the deleted column/table even after the database has been migrated.
On the other hand, the Waffle API is attractively simple and does not require database migrations since it uses a centralized generic table to store all Feature Toggles.
Since the well-maintained Waffle library already has extensive built-in capabilities for Rollout Toggles (controlling percentage of population) and Group Toggles (controlling users, roles, etc via its Flag attributes), it provides a more comprehensive framework for Feature Toggles than Config Models do out of the box.
One thing to note, however, is the tradeoff made between (a) supporting Least Privilege (via Config Model) and (b) Developer ease-of-use and Code maintainability (via Waffle). Since Config Models are stored in distributed tables, operators can easily place fine-grained control over who has access to which tables. This will be much harder to implement using Waffle. With Waffle, we can easily detect, but not prevent, access to feature toggles.
There are various open-source service discovery and distributed configuration libraries that provide a flexible key-value storage to manage Feature Toggles amongst other dynamic configuration settings. For example, Zookeeper, Consul, and etcd are viable options.
Unlike Waffle and Config Models, these services provide out-of-the-box support for centrally managing and synchronizing configuration changes across all microservices in a distributed system. This is where we ultimately want to be.
However, since we expect that migrating our platform to use such a service will be a large undertaking, we are postponing that effort to a later date. In the meantime, this OEP focuses on enabling teams to align on a common strategy for dynamically configuring and managing application-specific Feature Toggles.