Rollback On-demand
Definition
Rollback on-demand means the ability to quickly and safely revert to a previous working version of your application at any time, without requiring special approval, manual intervention, or complex procedures. It should be as simple and reliable as deploying forward.
Key principles:
- Fast: Rollback completes in minutes, not hours
- Automated: No manual steps or special procedures
- Safe: Rollback is validated just like forward deployment
- Simple: Single command or button click initiates rollback
- Tested: Rollback mechanism is regularly tested, not just used in emergencies
Why This Matters
Without reliable rollback capability:
- Fear of deployment: Teams avoid deploying because failures are hard to recover from
- Long incident resolution: Hours wasted debugging instead of immediately reverting
- Customer impact: Users suffer while teams scramble to fix issues
- Pressure to “fix forward”: Teams rush incomplete fixes instead of safely rolling back
- Deployment delays: Risk aversion slows down release cycles
With reliable rollback:
- Deployment confidence: Knowing you can roll back reduces fear
- Fast recovery: Minutes to restore service instead of hours
- Reduced risk: Bad deployments have minimal customer impact
- Better decisions: Teams can safely experiment and learn
- Higher deployment frequency: Confidence enables more frequent releases
What “Rollback On-demand” Means
Rollback is a Deployment
Rolling back means deploying a previous artifact version through your standard pipeline:
Not this:
Rollback is Tested
Rollback mechanisms should be tested regularly, not just during incidents:
- Practice rollbacks during non-critical times
- Include rollback tests in your pipeline
- Time your rollback to ensure it meets SLAs
- Verify rollback doesn’t break anything
Rollback is Fast
Rollback should be faster than forward deployment:
- Skip build stage (artifact already exists)
- Skip test stage (artifact was already tested)
- Go straight to deployment with previous artifact
Target: < 5 minutes from rollback decision to service restored.
Rollback is Safe
Rollback should:
- Deploy through the same pipeline (not a manual process)
- Run smoke tests to verify the rollback worked
- Update monitoring and alerts
- Maintain audit trail
Example Implementations
Anti-Pattern: Manual Rollback Process
Problem: Slow, manual, error-prone, no validation.
Good Pattern: Automated Rollback
Usage:
Benefit: Fast, automated, validated, audited.
What is Improved
- Mean Time To Recovery (MTTR): Drops from hours to minutes
- Deployment frequency: Increases due to reduced risk
- Team confidence: Higher willingness to deploy
- Customer satisfaction: Faster incident resolution
- Learning: Teams can safely experiment
- On-call burden: Reduced stress for on-call engineers
Common Patterns
Blue-Green Deployment
Maintain two identical environments:
Canary Rollback
Roll back gradually:
Feature Flag Rollback
Disable problematic features without redeploying:
Database-Safe Rollback
Design schema changes to support rollback:
Use expand-contract pattern:
- Expand: Add new column (both versions work)
- Migrate: Start using new column
- Contract: Remove old column (later, when safe)
Artifact Registry Retention
Keep previous artifacts available:
Ensures you can always roll back to recent versions.
FAQ
How far back should we be able to roll back?
Minimum: Last 3-5 production releases. Ideally: Any production release from the past 30-90 days. Balance storage costs with rollback flexibility.
What if the database schema changed?
Design schema changes to be backward-compatible:
- Use expand-contract pattern
- Make schema changes in separate deployment from code changes
- Test that old code works with new schema
What if we need to roll back the database too?
Database rollbacks are risky. Instead:
- Design schema changes to support rollback (backward compatibility)
- Use feature flags to disable code using new schema
- If absolutely necessary, have tested database rollback scripts
Should rollback require approval?
For production: On-call engineer should be empowered to roll back immediately without approval. Speed of recovery is critical. Post-rollback review is appropriate, but don’t delay the rollback.
How do we test rollback?
- Practice regularly: Perform rollback drills during low-traffic periods
- Automate testing: Include rollback in your pipeline tests
- Use staging: Test rollback in staging before production deployments
- Chaos engineering: Randomly trigger rollbacks to ensure they work
What if rollback fails?
Have a rollback-of-rollback plan:
- Roll forward to the next known-good version
- Use feature flags to disable problematic features
- Have out-of-band deployment method (last resort)
But if rollback is regularly tested, failures should be rare.
How long should rollback take?
Target: < 5 minutes from decision to service restored.
Breakdown:
- Trigger: < 30 seconds
- Deploy: 2-3 minutes
- Verify: 1-2 minutes
What about configuration changes?
Configuration should be versioned with the artifact. Rolling back the artifact rolls back the configuration. See Application Configuration.
Health Metrics
- Rollback success rate: Should be > 99%
- Mean Time To Rollback (MTTR): Should be < 5 minutes
- Rollback test frequency: At least monthly
- Rollback usage: Track how often rollback is used (helps justify investment)
- Failed rollback incidents: Should be nearly zero
Additional Resources
- Site Reliability Engineering: Release Engineering
- Martin Fowler: Blue-Green Deployment
- Martin Fowler: Canary Release
- Dave Farley: Rollback and Roll Forward
- Refactoring Databases: Evolutionary Database Design - Scott Ambler, Pramod Sadalage