Reliable Software Systems
Just writing applications is the tip of the iceberg. If you want them to be robust, reliable, and have few incidents, there are a lot more policies, procedures, and best practices your team should follow.
This section aligns with the principles of 12-Factor apps, and devops. If you follow all of the best practices listed here, your system should be highly available, reliable, and have few incidents.
Reliability Checklist
- Version Control: Code and configuration should be tracked in a version control system.
- Code Quality: Code should be automatically evaluated by certain quality metrics.
- Separation of Duties: Code needs to be peer reviewed as a requirement.
- Configuration Management: It should be possible to reliably re-create your infrastructure from code.
- CICD: You should have an automated way to build, test, and deploy your applications.
- Testing: There should be ample automated tests of your system that run routinely.
- Environment Parity: Your environments should be as similar as possible.
- Application States and Signals: Your applications should expose health information and be able to gracefully shut down.
- Boot and Shutdown Times: Applications should boot and shutdown promptly for scaling purposes.
- Stateless: Your applications should ideally be stateless.
- Deployment Patterns: You should be able to update your applications safely with zero down-time.
- High Availability: Your critical applications should be replicated across availability zones and regions. They should automatically scale based on demand.
- Load Balancing: Traffic to your services should be balanced among your replicas.
- Monitoring: You should have ample means to investigate, troubleshoot, and observe your system. You should be alerted if it is unhealthy.
- Security: Your system should follow industry best practices for security and implement defense in depth.
- Documentation: Thoroughly document your code and services for fellow engineers and end users.