Production Ready Checklist
This document considers the production readiness of a solution from an infrastructure and applicaton perspective.
- Backups and Restoring
- Code Efficiency and Performance
- Code Readability
- Continuous Delivery
- Continuous Integration
- Cost
- Documentation
- High Availability
- Logging
- Maintenance
- Monitoring
- Networking
- Performance
- Provisioning and Deployment
- Scalability
- Security
- Testing
Backups and Restoring
- All code is in shared repository
- Configure backup
- Configure cross-account backup
- Event Storage can be backed up
- Event Storage is routinely backed up
- Event Storage restore has been tested
- Solution can be easily deployed via pipeline
- Test your backups
Code Efficiency and Performance
- Code performance has been measured and tested
Code Readability
- Code follows standard naming style conventions
- Code has sufficient documentation for increased readability
- Code is being formatted automatically
- Code is being linted automatically
Continuous Delivery
- Automate your deployments
- Create deployment environments
- Define your infrastructure as code
- Do zero-downtime deployments
- Promote artifacts
- Roll back in case of failure
- Set up immutable infrastructure
- Set up per-environment configuration
- Test your infrastructure code
- Use canary deployments
- Use feature toggles
Continuous Integration
- Configure a build system
- Configure static analysis
- Do code reviews
- Pick a Version Control System
- Publish versioned artifacts
- Set up a build server
- Set up automated tests
- Set up automatic code formatting
- Use dependency management
Cost
- Clean up old data with S3 Lifecycle settings
- Clean up unused resources
- Cost analysis has been done against all resources used by Solution
- Create billing alarms
- Learn to analyze your AWS bill
- Pick proper EC2 Instance types and sizes
- Resources are on a pay per use model
- Shut down EC2 Instances and RDS DBs when not using them
- Use Auto Scaling
- Use Docker when possible
- Use Lambda when possible
- Use Reserved EC2 Instances for dedicated work
- Use Spot EC2 Instances for background jobs
Documentation
- An architecture diagram is in place
- Code has sufficient comments and docstrings
- Documentation is kept within the repository
- Operational Runbooks are included
- Updated and accurate README is included in repository
High Availability
- Configure Auto Recovery
- Configure graceful degradation
- Perform load tests and use chaos engineering
- Solution has load balancing for all requests
- Solution is run concurrently in multiple regions
- Solution requires no downtime for maintenance
Logging
- Solution has logging in place through Cloudwatch Logs
Maintenance
- Configure disk space
- Set up cron jobs
- Solution has sufficient documentation for code maintenance to take place
- Solution has sufficient high availability for maintenance to take place
- Solution has sufficient monitoring for maintenance to take place
Monitoring
- Configure services for observability
- Set up alerts
- Solution has alerting in place
- Solution has desired monitoring defined
- Solution has monitoring in place
- Store logs
- Track application metrics
- Track availability metrics
- Track business metrics
- Track server metrics
Networking
- Configure DNS using Route 53
- Configure Network ACLs
- Configure Security Groups
- Configure Static IPs
- Scalability in place for sudden increase in incoming network traffic
- Set up subnets
- Set up VPCs
- Solution has load balancing in front of exposed HTTP endpoints
- Stress test has been done on volume of webhook ingestion
Performance
- Degraded performance is handled gracefully
- Expected performance in ideal conditions is defined
- Stress test has been done on solution performance
Provisioning and Deployment
- Build AMIs
- Build Docker images
- Configure CPU, memory and GC settings
- Configure hard drives
- Deploy AMIs using Auto Scaling Groups
- Deploy Docker images using ECS, EKS or Fargate
- Deploy serverless apps using Lambda and API Gateway
- Solution can be deployed automatically via a pipeline
- Solution has release tags
- Solution is defined in Infrastructure as code
- Solution pipeline has multiple environments for dev and prod
Scalability
- Set up load balancing
- Solution resources can auto scale to handle workload
- Stress test has been done on autoscaling
- Use multiple Availability Zones
- Use multiple Instances
Security
- API usage requires credentials or authentication
- AWS IAM Permissions are minimal, users only have permissions they need
- Configure encryption at rest
- Create a password policy and enforce MFA
- Create cross-account IAM Roles
- Create IAM Groups
- Create IAM Roles
- Create IAM Users
- Data is encrypted at REST
- Deploy a Bastion Host
- Deploy a VPN Server
- Go through a security audit
- Go through the OWASP Top 10
- Network traffic is over HTTPS
- No secrets exposed in repository
- Record audit Logs
- Set up a secrets management solution
- Set up SSH access
- Sign up for security advisories
- Use server hardening practices
Testing
- Automated coverage report is being carried out
- Automated testing is being utilised in the pipeline
- Testing safety net is in place