Sections

BreadCrumbs

/Michael Knapp's Home Page

Reliable Software Systems

Just writing applications is the tip of the iceberg. If you want them to be robust, reliable, and have few incidents, there are a lot more policies, procedures, and best practices your team should follow.

This section aligns with the principles of 12-Factor apps, and devops. If you follow all of the best practices listed here, your system should be highly available, reliable, and have few incidents.

Reliability Checklist

Version Control: Code and configuration should be tracked in a version control system.
Code Quality: Code should be automatically evaluated by certain quality metrics.
Separation of Duties: Code needs to be peer reviewed as a requirement.
Configuration Management: It should be possible to reliably re-create your infrastructure from code.
CICD: You should have an automated way to build, test, and deploy your applications.
Testing: There should be ample automated tests of your system that run routinely.
Environment Parity: Your environments should be as similar as possible.
Application States and Signals: Your applications should expose health information and be able to gracefully shut down.
Boot and Shutdown Times: Applications should boot and shutdown promptly for scaling purposes.
Stateless: Your applications should ideally be stateless.
Deployment Patterns: You should be able to update your applications safely with zero down-time.
High Availability: Your critical applications should be replicated across availability zones and regions. They should automatically scale based on demand.
Load Balancing: Traffic to your services should be balanced among your replicas.
Monitoring: You should have ample means to investigate, troubleshoot, and observe your system. You should be alerted if it is unhealthy.
Security: Your system should follow industry best practices for security and implement defense in depth.
Documentation: Thoroughly document your code and services for fellow engineers and end users.