Collection of interesting talks on AWS security at re:Invent and re:Inforce 2019.
I wanted to get a better understanding of firecracker microVM security, from the bottom up. A few questions –
a) how does firecracker design achieve a smaller threat surface than a typical vm/container ?
b) what mechanisms are available to secure code running in a microvm ?
c) and lastly, how can microvms change security considerations when deploying code for web services ?
The following design elements contribute to a smaller threat surface:
- minimal design, in a memory safe, compact, readable rust language
- minimal guest virtual device model: a network device, a block I/O device, a timer, a KVM clock, a serial console, and a partial keyboard
- minimal networking; from docs/vsock.md : “The Firecracker vsock device aims to provide full virtio-vsock support to software running inside the guest VM, while bypassing vhost kernel code on the host. To that end, Firecracker implements the virtio-vsock device model, and mediates communication between AF_UNIX sockets (on the host end) and AF_VSOCK sockets (on the guest end).”
- static linking of the firecracker process limits dependancies
- seccomp BPF limits the system calls to 35 allowed calls, 30 with simple filtering, 5 with advanced filtering that limits the call based on parameters (SeccompFilter::new call in vmm/src/default_syscalls/filters.rs, seccomp/src/lib.rs)
The production security setup recommends using jailer to apply isolation based on cgroups, namespaces, seccomp. These techniques are typical of container isolation and act in addition to KVM based isolation.
The Host Security Configuration recommends a series of checks to mitigate side-channel issues for a multi-tenant system:
- Disable Simultaneous Multithreading (SMT)
- Check Kernel Page-Table Isolation (KPTI) support
- Disable Kernel Same-page Merging (KSM)
- Check for speculative branch prediction issue mitigation
- Apply L1 Terminal Fault (L1TF) mitigation
- Apply Speculative Store Bypass (SSBD) mitigation
- Use memory with Rowhammer mitigation support
- Disable swapping to disk or enable secure swap
How is the firecracker process organized ? The docs/design.md has the following descriptions:
Internal Design: Each Firecracker process encapsulates one and only one microVM. The process runs the following threads: API, VMM and vCPU(s). The API thread is responsible for Firecracker’s API server and associated control plane. It’s never in the fast path of the virtual machine. The VMM thread exposes the machine model, minimal legacy device model, microVM metadata service (MMDS) and VirtIO device emulated Net and Block devices, complete with I/O rate limiting. In addition to them, there are one or more vCPU threads (one per guest CPU core). They are created via KVM and run the `KVM_RUN` main loop. They execute synchronous I/OÂ and memory-mapped I/O operations on devices models.
Threat Containment: From a security perspective, all vCPU threads are considered to be running malicious code as soon as they have been started; these malicious threads needÂ to be contained. Containment is achieved by nesting several trust zones which increment from least trusted or least safe (guest vCPU threads) to most trusted or safest (host). These trusted zones are separated by barriers that enforce aspects of Firecracker security. For example, all outbound network traffic data is copied by the Firecracker I/O thread from the emulated network interface toÂ the backing host TAP device, and I/O rate limiting is applied at this point.
As to question about mechanisms to secure code running in firecracker, the serverless environment, AWS Lambda, and its security best practices are a good place to start . A few resources on this are here, here, here and here. AWS API gateway supports input validation, as described here. However, while serverless reduces the attack surface, the web threats such as OWASP still apply and must be taken into account during design and testing.
For the last question – uVMs and serverless appear to offer a promising model to build a service incrementally from small secure building blocks – and this is something to explore further.
Instead of the “inside” and “outside” notion of traditional firewalls and perimeter defense technologies, the Zero Trust Network notion has its origin in the Cloud+Mobile first world where a person carrying a mobile device can be anywhere in the world (inside/outside the enterprise) and needs to be seamlessly and securely connected to online services.
The essential idea appears to be device authentication coupled with a second factor in the shape of an easy to remember password, with backend security smarts to identify the accessing device. More importantly, every service that is access externally needs to be authenticated, instead of some services being treated as internal services and being less protected.
Some properties of zero trust networks:
- Network locality based access control is insufficient
- Every device, user and service is authenticated
- Policies are dynamic – they gather and utilize data inputs for making access control decisions
- Attacks from trusted insiders are mitigated against
This is a big change from many networks which have network based defense at the core (for good reason, as it was cost effective). To create a zero-trust network, a startin point is to identify, enumerate and sequence all network flows.
I attended a talk by Centrify on this topic, which resonated with experiences in cloud, mobile and fog systems.
Related effort in Kubernetes – Progress Toward Zero Trust Kubernetes Networks, Istio Service Mesh , API Gateway to Service Mesh. One can contrast the API gateway as being present only at the ingress point of a cloud, whereas with a Zero-trust/Service-mesh/Sidecar approach every microservice building-block has its own external proxy and ‘API’ for management added to it. The latter would add to latency concerns for real-time applications, as the new sidecar proxies are in the data path. One benefit of the service mesh is a mechanism to put in service to service security in a uniform manner.
The key original motivation behind Istio, in the second presentation by Lyft above, was greater observability and reliability across a complex cluster of microservices. This strikes me as a greater motivating use-case of this technology, than added security. From the security point of view, there is a parallel of the Istio approach with the SDN problem statement of a horizontal and ubiquitous security layer. Greater visibility is also a motivation behind the P4 programming language presented in disaggregated storage talk on protocol independant switch architecture or PISA here – one of the things it enables is inband telemetry.
SCRAM is an interesting proposal (RFC-5802) that aims to remove passwords being commonly sent across the wire. It does not appear to create additional requirements for certificates or shared secrets, so let’s see how it works.
The server is required to know the username in advance, but not the password, instead a hash of the password and a (per-user) salt and an iteration count which is used to create a challenge.
The client sends the username and a nonce. The server retrieves the salt and updates the iteration count and sends these back to the client as a challenge. The client hashes the password with the agreed upon hash function, and uses the salt and the iteration count in the calculation, and send it back to the server. The server is able to validate correctness of the hashed password with the information it has. The server then sends back a hash which the client can check to validate the server.
There are several issues with it – the initial registration flow is left out, the requirements of the client and server to issue good nonces and maintain unique salts and iterations are high, and also the requirement for the server database itself to be secure – an exfiltration could enable brute force attacks. Then it uses SHA-1 which is weak. The password is fixed and an update method would need to be designed for a full system.
Still it is interesting as a way to remove passwords being sent over the wire.
The protocol is used in XMPP as a standard mechanism for authentication.
A MAC or Message Authentication Code protects the integrity and authenticity of a message by allowing verifiers to detect changes to the message content. It requires a random key generation algo that produces a per-message random key K, a signing algorithm which takes K and message M as input and produces signature S, and a verifying algorithm with takes K, M and S as input and produces a binary decision to accept or reject the message. Unlike a digital signature a MAC typically does not provide non-repudiation. It is also called a protected checksum. Both sender and recipient of the K and M share a secret key.
HMAC: So called Hashed-MAC because it uses a cryptographic hash function, such as MD/SHA to create a MAC. The computed value is something only someone with the secret key can compute (sign) and check (verify). HMAC uses an inner key and an outer key to protect against length extension and collision attacks on simple MAC signature implementations. RFC 2104. It is a type of Nested MAC (NMAC) where both inner and outer keys are derived from the same key, in a way that keeps the derived keys independent.
HOTP: HMAC-based One Time Password. HOTP is based on an incrementing counter. The incrementing counter serves as the message M, and when run through the HMAC it produces a random set of bytes, which can be verified by the receiving party. The receiving party keeps a synchronized counter, so the message M=C does not need to be send on the wire. RFC 4226.
TOTP: Time-based One-time Password Algorithm . TOTP combines a secret key with the current timestamp using a cryptographic hash function to generate a one-time password. Because network latency and out-of-sync clocks can result in the password recipient having to try a range of possible times to authenticate against, the timestamp typically increases in 30-second intervals. Here the requirement to keep the counter synchronized is replaced with time synchronization. RFC 6328.
The primality testing idea for big numbers is to repeatedly check for a condition satisfied by primes, such as a^p mod p = a mod p, for different a, until a probabality bound is met.
When implementing TouchID for an enterprise authentication solution there are some interesting attack vectors to consider, that are not obvious.
There are differences in requirements between COPE and BYOD deployments for instance.
Depending on the type of deployment and the type of data accessed, the security required may call for (a) a simple TouchId based “user presence check”, without a password being stored or retrieved, or (b) for a password to be stored in the enclave to be retrieved, or (c) for TouchId to be combined with another factor for a multi-factor authentication solution.
Some drawbacks to the initial TouchID implementation for enterprise uses cases, were discussed here . There is now a developer API available which allows more flexibility in implementing a solution for the enterprise.