This is the multi-page printable view of this section. Click here to print.
Recommended Practices for Continuous Delivery
1 - Agentic Continuous Delivery
With the advent of coding agents, large portions of source code can be generated by AI in a short period of time. However, unless additional artifacts are introduced, agentic drift code quality issues and technical debt will accumulate quickly. By the time this becomes visible, it’s often unmanageable and irreversible. The introduction of additional first-class artifacts (“documentation artifacts”) can address this challenge. These artifacts need to become part of the repository itself. Continously maintaining and delivering these additional artifacts becomes the key control variable in sustained use of agentic coding.
Agentic Extensions to MinimumCD
Agentic Continuous Delivery (ACD) is the application of Continuous Delivery in environments where software changes are proposed by agents. ACD exists to reliably constrain agent autonomy without slowing delivery.
It extends MinimumCD by the following constraints:
- Explicit, human-owned intent exists for every change
- Intent and architecture are represented as first-class artifacts
- All first-class artifacts are versioned and delivered together with the change
- Intended behavior is represented independently of implementation
- Consistency between intent, tests, implementation, and architecture is enforced
- Agent-generated changes must comply with all documented constraints
- Agents implementing changes must not be able to promote those changes to production
- While the pipeline is red, agents may only generate changes restoring pipeline health
These constraints are not mandatory practices. They describe the minimum conditions required to sustain delivery pace once agents are making changes to the system.
First-Class Artifacts
In ACD, every first-class artifact:
- Is required to exist
- Has a clearly defined authority
- Must always be consistent with other first-class artifacts
- Cannot be silently bypassed by agents or humans
It is part of the delivery contract, not a convenience.
- Agents may read any or all artifacts.
- Agents may generate some artifacts.
- Agents may not redefine the authority of any artifact.
- Humans own the accountability for the artifacts.
Artifact Overview
| Artifact | Role (Why it exists) | Authority | What it Constrains | Purpose in ACD | Example |
|---|---|---|---|---|---|
| Intent Description (Demand / Requirement) | Defines why the change exists | Human-owned intent | Scope, direction, outcome acceptability | Trust anchor for all other artifacts | “User activity data export required for compliance.” |
| User-Facing Behavior (Feature User Guide) | Defines what users experience | Externally observable semantics | Tests, behavior, backward compatibility | Prevent unexplained behavioral drift | “Export available under Profile: Activity Export.” |
| Feature Description (Implementation Manual) | Preserves implementation trade-offs | Engineering constraints | Technical decision boundaries, agent freedom | Prevent agentic trade-off drift | “Timestamped PDF retained for non-repudiability.” |
| Executable Truth (Test Scenarios) | Makes intent falsifiable | Pipeline-enforced correctness | Code, refactoring, optimization | Enforce consistency | “Tests validating report completeness.” |
| Implementation (Code) | Implements behavior | Runtime mechanics only | Fully constrained by other artifacts | Deliver the solution | Backend + frontend export logic |
| System Constraints (Architecture Overview) | Defines global invariants | System-level rules | All features, implementations, agent proposals | Maintain global integrity | “Always use MVC. No business logic in APIs.” |
Why These Artifacts Must Be Separate
They are intentionally overlapping in content but non-overlapping in authority. The content overlap is necessary to control drift.
This prevents a critical agentic failure mode:
Resolving inconsistency by rewriting the wrong thing.
Why This Is Not Documentation Overhead
ACD treats semantic artifacts as first-class to preserve consistent meaning over time.
In ACD:
- Artifacts are inputs to enforcement
- Not outputs for humans to read
They exist so that:
- tools can reference them
- agents can be constrained by them
- humans can steer exceptions and conflicts
Removing even a single first-class artifact reduces the reliability of ACD reference frame.
Closing remarks
None of these artifacts are required exclusively when working with coding agents: They should exist for any long-term development project. However, creating and maintaining them as part of the delivery process becomes crucial to minimize the risk of agent-induced failure.
2 - Feature Flag Guidance
Feature flags are a useful tool. However, they are also often misused because people fail to consider other options when it comes to hiding incomplete features to enable frequent code integration. Below is a chart that covers common reasons people reach for feature flags and why some of those reasons are wrong. Also, you don’t need a complicated tool for feature flags… until you do. See the section below the decision tree for examples of feature flag implementation based on use case.
graph TD
Start[New Code Change] --> Q1{Is this a large or<br/>high-risk change?}
Q1 -->|Yes| Q2{Do you need gradual<br/>rollout or testing<br/>in production?}
Q1 -->|No| Q3{Is the feature<br/>incomplete or spans<br/>multiple releases?}
Q2 -->|Yes| UseFF1[YES - USE FEATURE FLAG<br/>Enables safe rollout<br/>and quick rollback]
Q2 -->|No| Q4{Do you need to<br/>test in production<br/>before full release?}
Q3 -->|Yes| Q3A{Can you use an<br/>alternative pattern?}
Q3 -->|No| Q5{Do different users/<br/>customers need<br/>different behavior?}
Q3A -->|New Feature| NoFF_NewFeature[NO - NO FEATURE FLAG<br/>Connect to tests only,<br/>integrate in final commit]
Q3A -->|Behavior Change| NoFF_Abstraction[NO - NO FEATURE FLAG<br/>Use branch by<br/>abstraction pattern]
Q3A -->|New API Route| NoFF_API[NO - NO FEATURE FLAG<br/>Build route, expose<br/>as last change]
Q3A -->|Not Applicable| UseFF2[YES - USE FEATURE FLAG<br/>Enables trunk-based<br/>development]
Q4 -->|Yes| UseFF3[YES - USE FEATURE FLAG<br/>Dark launch or<br/>beta testing]
Q4 -->|No| Q6{Is this an<br/>experiment or<br/>A/B test?}
Q5 -->|Yes| UseFF4[YES - USE FEATURE FLAG<br/>Customer-specific<br/>toggles needed]
Q5 -->|No| Q7{Does change require<br/>coordination with<br/>other teams/services?}
Q6 -->|Yes| UseFF5[YES - USE FEATURE FLAG<br/>Required for<br/>experimentation]
Q6 -->|No| NoFF1[NO - NO FEATURE FLAG<br/>Simple change,<br/>deploy directly]
Q7 -->|Yes| UseFF6[YES - USE FEATURE FLAG<br/>Enables independent<br/>deployment]
Q7 -->|No| Q8{Is this a bug fix<br/>or hotfix?}
Q8 -->|Yes| NoFF2[NO - NO FEATURE FLAG<br/>Deploy immediately]
Q8 -->|No| NoFF3[NO - NO FEATURE FLAG<br/>Standard deployment<br/>sufficient]
style UseFF1 fill:#90EE90
style UseFF2 fill:#90EE90
style UseFF3 fill:#90EE90
style UseFF4 fill:#90EE90
style UseFF5 fill:#90EE90
style UseFF6 fill:#90EE90
style NoFF1 fill:#FFB6C6
style NoFF2 fill:#FFB6C6
style NoFF3 fill:#FFB6C6
style NoFF_NewFeature fill:#FFB6C6
style NoFF_Abstraction fill:#FFB6C6
style NoFF_API fill:#FFB6C6
style Start fill:#87CEEB
Feature Flag Implementation Approaches
Static Code-Based
Hardcoded constants, configuration files, environment variables Changes require deployment or restart Best for: Stable flags, environment-specific behavior
Dynamic In-Process
Database queries, cache lookups, file watching Changes take effect without restart Best for: Simple dynamic flags within a single application
Centralized Service
Dedicated flag service (self-hosted or SaaS) HTTP/RPC calls to fetch flag state Best for: Multiple applications, complex targeting, team collaboration
Infrastructure Routing
Load balancer rules, reverse proxy logic, service mesh routing Traffic directed based on headers, cookies, or user attributes Best for: Routing to entirely different services/versions
Edge/Gateway Level
API gateway, CDN, edge computing platforms Flag evaluation at the network edge before reaching application Best for: Global scale, minimal latency impact, frontend routing
Hybrid/Multi-Layer
Combination of application logic + infrastructure routing Different layers for different concerns (kill switch vs. granular logic) Best for: Complex systems requiring defense in depth
Key Decision Factors
- Dynamism: How quickly must flags change? (deployment vs. runtime)
- Scope: Single service vs. multiple services vs. entire infrastructure
- Targeting complexity: Boolean vs. user segments vs. percentage rollouts
- Performance: Acceptable latency for flag evaluation
- Operational burden: What infrastructure can your team maintain?
- Cost: Build vs. buy tradeoffs
Temporary Feature Flag Lifecycle
Most feature flags should be temporary. Leaving flags in place indefinitely creates technical debt, increases complexity, and makes the codebase harder to maintain. Follow this lifecycle for temporary feature flags:
1. Create Flag with Removal Plan
Before writing any code, create the flag with a clear removal strategy:
Create backlog items:
- Story: Implement feature behind flag
- Cleanup story: Remove feature flag (add this immediately to backlog)
Document the flag:
Set an expiration date - Most flags should be removed within 2-4 weeks after full rollout.
2. Deploy Flag in OFF State
Initial deployment validates the flag mechanism:
Commit and deploy with flag OFF:
- Validates flag can be toggled without code changes
- Confirms flag infrastructure is working
- No user-facing changes yet
3. Build Feature Incrementally
Integrate code to trunk daily while flag is OFF:
Each commit:
- Has passing tests for the new feature
- Doesn’t affect production users (flag is OFF)
- Integrates to trunk daily
4. Test in Production (Dark Launch)
Turn flag ON for internal users only:
Validation checklist:
- Feature works as expected in production environment
- No performance degradation
- Error rates remain normal
- Monitoring and alerts functioning
5. Gradual Rollout
Progressively increase user exposure:
Monitor at each stage:
- Error rates
- Performance metrics
- User feedback
- Business metrics
Rollback immediately if issues detected - This is the primary value of the flag.
6. Complete Rollout
Once 100% of users have the new feature:
Wait for stability period:
- Run at 100% for 3-7 days
- Confirm no issues emerge
- Verify rollback is no longer needed
7. Remove the Flag (CRITICAL)
This step must not be skipped:
Week 1-2 after 100% rollout:
- Prioritize the cleanup story
- Remove flag checks from code
- Delete flag configuration
- Remove flag from flag management system
Before:
After:
Complete cleanup:
- Remove old implementation code
- Remove flag-related tests
- Remove flag documentation
- Update monitoring/alerts if needed
Lifecycle Timeline Example
| Day | Action | Flag State |
|---|---|---|
| 1 | Deploy flag infrastructure + create removal ticket | OFF |
| 2-5 | Build feature behind flag, integrate daily | OFF |
| 6 | Enable for internal users (dark launch) | ON for 0.1% |
| 7 | Enable for 1% of users | ON for 1% |
| 8 | Enable for 5% of users | ON for 5% |
| 9 | Enable for 25% of users | ON for 25% |
| 10 | Enable for 50% of users | ON for 50% |
| 11 | Enable for 100% of users | ON for 100% |
| 12-18 | Stability period (monitor) | ON for 100% |
| 19-21 | Remove flag from code | DELETED |
Total lifecycle: ~3 weeks from creation to removal
Long-Lived Feature Flags
Some flags are intentionally permanent and should be managed differently:
Operational Flags (Kill Switches)
Purpose: Disable expensive features under load Lifecycle: Permanent Management: Treat as system configuration, document clearly
Customer-Specific Toggles
Purpose: Different customers get different features Lifecycle: Permanent (tied to customer contracts) Management: Part of customer configuration system
Experimentation Flags
Purpose: A/B testing and experimentation Lifecycle: Permanent flag, temporary experiments Management: Experiment metadata has expiration, flag infrastructure is permanent
Mark permanent flags clearly:
- Document why they’re permanent
- Different naming convention (e.g.,
KILL_SWITCH_*,ENTITLEMENT_*) - Separate from temporary flags in management system
- Regular review to confirm still needed
Anti-Patterns to Avoid
Don’t skip the removal ticket:
- WRONG: “We’ll remove it later when we have time”
- RIGHT: Create removal ticket when creating the flag
Don’t leave flags indefinitely:
- WRONG: Flag still in code 6 months after 100% rollout
- RIGHT: Remove within 2-4 weeks of full rollout
Don’t create nested flags:
- WRONG:
if (featureA && featureB && featureC) - RIGHT: Each feature has independent flag, removed promptly
Don’t forget to remove the old code:
- WRONG: Flag removed but old implementation still in codebase
- RIGHT: Remove flag AND old implementation together
Don’t make all flags permanent “just in case”:
- WRONG: “Let’s keep it in case we need to rollback in the future”
- RIGHT: After stability period, rollback is via deployment, not flag
3 - Work in Small Batches
We need to reduce batch size because smaller batches of work are easier to verify, they tend to fail small, we are less likely to suffer from sunk-cost fallacy, we amplify feedback loops, etc. How small should they be? As small as we can make them to get production feedback on what we are trying to learn. Working to reduce batch size acts as a forcing function for exposing and removing hidden waste in upstream processes. There are several batch sizes we are trying to reduce.
Deploy More Often
How often are we delivering changes to the end user? A common mistake is to only deploy completed features. It is far better to deploy something as soon as the pipeline certifies a change will not break the end-user. This could be as small as some tested code that won’t be used until several other small changes are delivered.
There are two common arguments against increasing deploy frequency. The first is a misunderstanding of “valuable”. “We don’t want to deliver incomplete features because the customer can’t use them so we aren’t delivering any value.”
There are more stakeholders requiring value than the end-user. One of those is the product team. We are reducing the level of inventory waste in our flow and getting rapid feedback that we haven’t broken existing behaviors with the new change. This gives us feedback on our quality gates and also lowers the risks of delivering a production fix.
The second objection is “our customers don’t want changes that frequently.”
This comes from a misunderstanding of what CD is for. Yes, we can deliver features with continuous delivery. However, a primary purpose of CD is production support. When production has an incident or we have a new zero-day vulnerability, they do want changes that frequently to resolve those problems. Can we? By improving delivery frequency, we are continuously verifying that we can still deliver those fixes safely.
Commit Smaller Changes
“Following our principle of working in small batches and building quality in, high-performing teams keep branches short-lived (less than one day’s work) and integrate them into trunk/master frequently. Each change triggers a build process that includes running unit tests. If any part of this process fails, developers fix it immediately.”
– Accelerate by Nicole Forsgren Ph.D., Jez Humble & Gene Kim
How small is small? One change a day is big. Smaller than that. These are not feature complete changes. They are small, tested changes that can be delivered to production if certified by the pipeline.
Solving the problems required to meet the definition of CI is foundational for the efforts to improve the organization. It is very effective at uncovering that we need to improve testing, learn how to use evolutionary coding practices, understand trunk-based development, learn to decompose work better, and learn how to work as a team better. It’s also effective at shining a light on upstream issues.
Refine Smaller Stories
How small is small? It’s typical for teams who have only been taught Scrum to refine work until it can “fit in the sprint.” Therefore, 5 to 10 day stories are very common. It’s also very common for those to take 10 to 15 days to actually be delivered due to the lack of clarity in the stories. To resolve this, we shrink the time box for a story and then fix everything that prevents us from staying within that time box.
In 2012, Paul Hammant, author of “Trunk-Based Development and Branch by Abstraction” made the following suggestion:
“Story sizes should average as close to one day as possible. If they don’t, your Agile project is going to be harder for nearly everyone involved. If your average is significantly greater than that one day, then change something until you get there.”
This may sound unachievable, but we have seen how effective this is in the enterprise Dojos. A primary workflow for Dojos is the “hyper-sprint”. A hyper-sprint lasts for 2.5 days and includes refining work, doing the work, delivering the work, and retrospecting on how to do it better next time. Teams fail for a few weeks but then learn the skills and teamwork required to slice stories into much thinner value increments with fully testable acceptance criteria and deliver them as a team. Coding moves from exploration to implementation and quality feedback and throughput accelerate. It’s very common for a team’s throughput to double in 6-8 weeks with the right guidance. Again, this acts as a forcing function for uncovering and removing upstream impediments with missing product information, external hard dependencies with other teams, Change Advisory Board compliance theater, or other organizational issues.
How to Decompose Work
Start with Behavior-Driven Development (BDD)
Behavior-Driven Development is the collaborative process where we discuss the intent and behaviors of a feature and document the understanding in a declarative, testable way. BDD is the foundation for both decomposing stories AND breaking work into very small changes for daily integration.
BDD is not a technology or automated tool - it is the process of defining behavior. Teams can then automate tests for those behaviors.
Writing Testable Acceptance Criteria
Use the Given-When-Then format (Gherkin language) to express behaviors in “Arrange, Act, Assert” that all stakeholders understand:
Example - User Story: “As a user, I want to clock out”
These testable acceptance criteria should be the Definition of Done for a user story.
Using BDD to Decompose Stories
If a story has more than 6 acceptance criteria, it can probably be split. Each scenario represents a potential smaller story:
Large story: “User can clock in and out”
Split into smaller stories using scenarios:
Story 1: Clock out after minimum time
Given I am clocked in for more than 5 minutes
When I enter my associate number
Then I will be clocked out
Story 2: Prevent early clock out
Given I am clocked in for less than 5 minutes
When I enter my associate number
Then I will see an error message
Story 3: Clock in with validation
Given I am clocked out
When I enter a valid associate number
Then I will be clocked in
Each story has clear, testable acceptance criteria that can be completed in 1-2 days.
Using BDD for Technical Decomposition
BDD helps achieve daily integration by breaking implementation into testable increments. Each acceptance criterion becomes a series of small, testable changes that can be implemented using acceptance-test driven development:
Scenario: Clock out after minimum time
Day 1, Commit 1:
Day 1, Commit 2:
Day 2, Commit 1:
Day 2, Commit 2:
Each commit:
- Has a passing test
- Doesn’t break existing functionality
- Can be integrated to trunk
- Moves incrementally toward the complete feature
Key principle: No acceptance test should contain more than 10 conditions. Scenarios should be focused on a specific function and should not attempt to describe multiple behaviors.
Vertical Slicing
After using BDD to define clear behaviors, each task should be sliced vertically, not horizontally. A vertical slice delivers complete functionality across all necessary technical layers to deliver a response to a request.
Technical Decomposition
The same BDD process used for story slicing applies to technical decomposition at the service, module, and function level. This enables daily integration while building complex features.
Acceptance Test-Driven Development (ATDD)
ATDD is the practice of writing executable acceptance tests before writing implementation code. Each small change follows this workflow:
ATDD Workflow (Red-Green-Refactor):
- Write the acceptance test (in Gherkin or your test framework)
- Run the test - it should fail (RED) because the behavior doesn’t exist yet
- Write minimal code to make the test pass
- Run the test - it should now pass (GREEN)
- Refactor if needed while keeping the test green
- Commit to trunk with passing tests
Key Benefits:
- Forces clarity about what “done” means before coding
- Provides immediate feedback when implementation is complete
- Creates living documentation of system behavior
- Prevents scope creep and over-engineering
- Ensures every commit has test coverage
Service-Level Decomposition Example
When decomposing work for a service or API, use the same BDD approach to define testable increments:
Scenario: Add order history endpoint
Day 1, Commit 1: Database query capability
ATDD Implementation:
- Write integration test (fails - RED)
- Create database query function
- Test passes (GREEN)
- Commit to trunk
Day 1, Commit 2: Service layer mapping
ATDD Implementation:
- Write service layer test (fails - RED)
- Create mapping function
- Test passes (GREEN)
- Commit to trunk
Day 2, Commit 1: API endpoint (not connected to routes)
ATDD Implementation:
- Write API contract test (fails - RED)
- Create endpoint handler (not yet exposed in routes)
- Test passes (GREEN)
- Commit to trunk
Day 2, Commit 2: Connect to routing (feature goes live)
ATDD Implementation:
- Write routing integration test (fails - RED)
- Add route configuration and middleware
- Test passes (GREEN)
- Commit to trunk - Feature now live
Module-Level Decomposition
For complex modules, break down into testable behaviors at the function level:
Example: Payment processing module
Instead of committing an entire payment module at once, decompose into daily commits:
Day 1:
- Payment validation (amount > 0, currency valid, card not expired)
- Test:
validatePaymentRequest()rejects invalid inputs
Day 2:
- Payment gateway adapter interface
- Test: Mock adapter responds to payment requests
Day 3:
- Retry logic for failed payments
- Test: Payment retries with exponential backoff
Day 4:
- Payment event logging
- Test: All payment attempts are logged with transaction ID
Day 5:
- Connect payment module to checkout flow
- Test: End-to-end checkout with payment succeeds
Each commit has passing tests, doesn’t break existing functionality, and can be integrated to trunk.
Contract Testing for API Changes
When changes affect API contracts (interfaces between services), define the contract as executable tests:
ATDD Implementation:
- Write contract test defining the API behavior (fails - RED)
- Implement the API endpoint to satisfy the contract
- Contract test passes (GREEN)
- Commit to trunk
Any changes to the API contract should be reflected as modified contract tests, integrated to trunk with passing tests.
Anti-Patterns to Avoid
Don’t slice by technical layer:
- WRONG: “Frontend story” + “Backend story” + “Database story”
- RIGHT: One story delivering complete functionality
Don’t slice by activity:
- WRONG: “Design story” + “Implementation story” + “Testing story”
- RIGHT: One story including all activities
Don’t create dependent stories:
- WRONG: “Story B can’t start until Story A is deployed”
- RIGHT: Each story is independently deployable
Don’t lose testability:
- WRONG: “Refactor database layer” (no testable user behavior)
- RIGHT: “Improve search performance to < 2 seconds” (measurable outcome)
Working Within a Vertical Slice
While stories should be sliced vertically, multiple developers can work the story with each developer picking up a task that represents a layer of the slice. The key is that the story isn’t considered “done” until all layers are complete and the feature is deployable.