Let’s take a disaster scenario where a system loses its data-in-transit (i.e. not yet persisted) at a certain point in time. and some time after this point, a recovery process kicks in, which restores the system back to normal functioning.
Recovery Point Objective refers to the amount of tolerable data loss measured in time. It can be measured in time based on the fact that it is in-transit data of a certain max velocity, so bounding the time bounds the amount of data that can be lost. This time figure, the RPO, is used to determine how frequently the data must be persisted and replicated. An RPO of 10 minutes implies the data must be backed up every 10 minutes. If there’s a crash the system can be restored to a point not more than 10 minutes prior to the time of crash. RPO determines frequency of backups, snapshots or transaction logs.
Recovery Time Objective refers to the amount of time required to restore a system to normal behavior after a disaster has happened. This includes restoration of all infrastructure components that provide a service, not just the restoration of data.
Lower RPO/RTO is higher cost.
Matrix of RPO – high/low vs RTO – high/low can be used to categorize applications.
Low RPO, Low RTO. Critical online application like a storefront.
Low RPO, High RTO. Data sensitive application but not online, like analytics.
High RPO, Low RTO. Redundantly available data or no data. Compute clusters that are highly available.
High RPO, High RPO. Non-prod systems – dev/test/qa ?
Amount of acceptable data loss <= App (data?) criticality.
One can expect a pyramid of apps – large number with less criticality, small number with high criticality
Repeatability. Backup and recovery procedures. Must be written. Must be tested. Automation.
HA/DR spectrum of solutions:
- Backups, save transaction logs
- Snapshots
- Replication – synchronous, asynchronous
- Storage only vs in-memory as well. Application level crash consistency of backups.
- Multiple AZs
- Hybrid
Tech: S3 versioning and DDB streams, Global tables.
Rules of thumb:
- test full recovery regularly, at least once an year.
- backup, backup, backup
- 3-2-1 Rule. https://us-cert.cisa.gov/sites/default/files/publications/data_backup_options.pdf
- Keep at least 3 copies of data in at least 2 media types, and at least one off-site backup.
Related terms: RPA and RTA
3 types of disasters.
- Natural disaster – e.g. floods, earthquakes, fire
- Technical failure – e.g. loss of power, cable pulled
- Human error – e.g. delete all files as admin
Replication – works for first two. Continuous snapshots/backup/versioning – for the last one. Replication will just delete the data on both sides. Need the ability to go back in time and restore data.
Cost – how to optimize cost of infrastructure and its maintenance.
Which region to choose ? Key considerations: What types of disasters are the concern (Risk). How much proximity is needed to end-customers and to primary region (Performance). What’s the cost of the region (Cost) ?