January 2021 – Secure Machinery

Let’s take a disaster scenario where a system loses its data-in-transit (i.e. not yet persisted) at a certain point in time. and some time after this point, a recovery process kicks in, which restores the system back to normal functioning.

Recovery Point Objective refers to the amount of tolerable data loss measured in time. It can be measured in time based on the fact that it is in-transit data of a certain max velocity, so bounding the time bounds the amount of data that can be lost. This time figure, the RPO, is used to determine how frequently the data must be persisted and replicated. An RPO of 10 minutes implies the data must be backed up every 10 minutes. If there’s a crash the system can be restored to a point not more than 10 minutes prior to the time of crash. RPO determines frequency of backups, snapshots or transaction logs.

Recovery Time Objective refers to the amount of time required to restore a system to normal behavior after a disaster has happened. This includes restoration of all infrastructure components that provide a service, not just the restoration of data.

Lower RPO/RTO is higher cost.

Matrix of RPO – high/low vs RTO – high/low can be used to categorize applications.

Low RPO, Low RTO. Critical online application like a storefront.

Low RPO, High RTO. Data sensitive application but not online, like analytics.

High RPO, Low RTO. Redundantly available data or no data. Compute clusters that are highly available.

High RPO, High RPO. Non-prod systems – dev/test/qa ?

Amount of acceptable data loss <= App (data?) criticality.
One can expect a pyramid of apps – large number with less criticality, small number with high criticality

Repeatability. Backup and recovery procedures. Must be written. Must be tested. Automation.

HA/DR spectrum of solutions:

Backups, save transaction logs
Snapshots
Replication – synchronous, asynchronous
Storage only vs in-memory as well. Application level crash consistency of backups.
Multiple AZs
Hybrid

Tech: S3 versioning and DDB streams, Global tables.

Rules of thumb:

test full recovery regularly, at least once an year.
backup, backup, backup
3-2-1 Rule. https://us-cert.cisa.gov/sites/default/files/publications/data_backup_options.pdf
Keep at least 3 copies of data in at least 2 media types, and at least one off-site backup.

Related terms: RPA and RTA

3 types of disasters.

Natural disaster – e.g. floods, earthquakes, fire
Technical failure – e.g. loss of power, cable pulled
Human error – e.g. delete all files as admin

Replication – works for first two. Continuous snapshots/backup/versioning – for the last one. Replication will just delete the data on both sides. Need the ability to go back in time and restore data.

Cost – how to optimize cost of infrastructure and its maintenance.

Which region to choose ? Key considerations: What types of disasters are the concern (Risk). How much proximity is needed to end-customers and to primary region (Performance). What’s the cost of the region (Cost) ?

SolarWinds makes software for managing networks and infrastructure. Its Orion software was the target of an advanced cyberattack in 2020. Hackers acquired superuser access to certificates used to sign SAML tokens. This certificate was used to forge new tokens to allow hackers highly privileged access to networks.

Attackers may have compromised internal build or distribution systems of SolarWinds, embedding backdoor code into a legitimate SolarWinds library with the file name SolarWinds.Orion.Core.BusinessLayer.dll. This backdoor could then be distributed via automatic updates in target networks.

The malicious DLL called out to a remote network infrastructure using the domain avsvmcloud.com. to prepare possible second-stage payloads, move laterally in the organization, and compromise or exfiltrate data.

The Cybersecurity and Infrastructure Security Agency issued Emergency Directive 21–01 in response to the incident, advising all federal civilian agencies to disable Orion.

SolarWinds Sunburst attack network paths (source)

Ref. https://web.archive.org/web/20201220053318/https://msrc-blog.microsoft.com/2020/12/13/customer-guidance-on-recent-nation-state-cyber-attacks/

Secure Machinery

On the evolution of security and intelligent machinery

Month: January 2021

Disaster Recovery: Understanding and designing for RPO and RTO