Category: security

Kubernetes security

Kubernetes is a Platform-as-a-Service (PAAS) build on top of (docker) containers, but with an additional unit of abstraction called a pod, which a) is its smallest unit of execution b) has a single external IP address, c) is a group of one or more containers where d) the group of containers are connected over a network namespace and e) each pod is isolated from others by network namespaces. Within a pod, different containers can see each other over different ports over a loopback interface. Within an instance, while different pods can see each other as different IP addresses. It has a control plane built on top of etcd, a consistent, distributed, highly available key-value store, which is an independent opensource CNCF project.

It is conceptually similar to Cloud Foundry, Mesos, OpenStack, Mirantis and similar abstraction layers.

A threat matrix for Kubernetes by MS – https://www.microsoft.com/security/blog/2020/04/02/attack-matrix-kubernetes/

From RSA’20, here’s a talk on ‘The future of Kubernetes attacks’ – https://youtu.be/CH7S5rE3j8w

From Coinbase, a blog on ‘Why Kubernetes is not part of our stack’https://blog.coinbase.com/container-technologies-at-coinbase-d4ae118dcb6c makes these points

  • it needs a full-time compute team to maintain
  • securing it is neither trivial nor well understood. SPIFFE, SPIRE, Envoy, Istio, OPA, service mesh are a few of the technologies.

This blog links to – https://k8s.af/

Another viewpoint – https://pythonspeed.com/articles/dont-need-kubernetes/

A counterpoint to the Coinbase blog – https://blog.kumina.nl/2020/07/in-response-to-container-technologies-at-coinbase/

Scratch notes:

K8S is based on a Controller pattern:

  • Resources capture the desired state
  • Current state is kept centralized in etcd, a distributed key-value store, similar to Consul
  • Controllers reconcile current state with desired state

Pod is a top level resource, is the smallest deployment unit, and is a group of one or more containers described by a yaml file, similar to docker-compose.yml .

K8S Operator is a kind of resource manager, for Custom resources.

https://blog.frankel.ch/your-own-kubernetes-controller/1/

https://pushbuildtestdeploy.com/when-do-kubernetes-operators-make-sense

Spinnaker is a Continuous Delivery platform that itself runs on k8s as a set of pods which can be scaled up

A kubectl cheat sheet:

https://kubernetes.io/docs/reference/kubectl/cheatsheet

An article on cloud security https://medium.com/xm-cyber/having-fun-with-cloud-services-e281f8a7fe60 , which I think makes the point of why things are relatively complex to begin with.

One comes across the terms helm and helm charts. Helm is a way to package a complex k8s application. This adds a layer of indirection to an app – https://stepan.wtf/to-helm-or-not/ .

A repo to list failing pods – https://github.com/edrevo/suspicious-pods

Exploring networking in k8s – https://dustinspecker.com/posts/how-do-kubernetes-and-docker-create-ip-addresses/

Plugin for Pod networking on EKS using ENIs – https://github.com/aws/amazon-vpc-cni-k8s

Hardening EKS with IAM, RBAC – https://snyk.io/blog/hardening-aws-eks-security-rbac-secure-imds-audit-logging/

EKS Authentication with IAM – how does it work ?

IAM is only used for authentication of valid IAM entities. All permissions for interacting with EKS Kubernetes API is managed through the native Kubernetes RBAC system.

AWS IAM Authenticator for EKS is a component that enables access to EKS via IAM, for provisioning, managing, updating the cluster. It runs on the EKS Control Plane – https://github.com/kubernetes-sigs/aws-iam-authenticator#aws-iam-authenticator-for-kubernetes

A k8s ConfigMap is used to store non-confidential data in key-value pairs.

The above authenticator gets its configuration information from the aws-auth ConfigMap. This ConfigMap can be edited via eksctl (recommended) or be directly edited.

A Kubernetes service account provides an identity for processes that run in a pod. For more information see Managing Service Accounts in the Kubernetes documentation. If your pod needs access to AWS services, you can map the service account to an AWS Identity and Access Management identity to grant that access.

RSA World 2020 – QRNG based crypto keys

Creating my first Quantum Crypto keys. The QC hardware provides the entropy source for generation of a key, which is encapsulated using one of several available mechanisms. Key encapsulation mechanism chosen is was Classic McEliece. Keys were emailed to me after generation at the Cambridge Quantum Computing booth at RSA World 2020.

Quantum-Proof Cryptography with IronBridge, TKET and Amazon Braket

Cambridge Quantum Computing has developed the first provable QRNG, known as IronBridge, which uses quantum computers to generate unbiased private entropy. IronBridge generates cryptographic keys using this entropy, resulting in quantum-proof keys (for both classical and post-quantum algorithms).

Their paper on “Practical randomness and privacy amplification” – https://arxiv.org/pdf/2009.06551.pdf

Links on McEliece, Goppa codes, FrodoKEM and PQC –

https://en.wikipedia.org/wiki/McEliece_cryptosystem

https://en.wikipedia.org/wiki/Binary_Goppa_code

FrodoKEM: https://eprint.iacr.org/2018/686.pdf : In 2016, Bos et al. proposed the key exchange scheme FrodoCCS, that is also a submission to the NIST post-quantum standardization process, modified as a key encapsulation mechanism (FrodoKEM). The security of the scheme is based on standard lattices
and the learning with errors problem

PQ Crypto Catalog: https://github.com/kriskwiatkowski/pqc  has implementations of quantum-safe signature and KEM schemes submitted to NIST PQC standardization process

Accuracy vs Recall vs Precision vs F1 in Machine Learning

We want to walk through some common metrics in classification problems – such as accuracy, precision and recall – to get a feel for when to use which metric. Say we are looking for a needle in a haystack. There are very few needles in a large haystack full of straws. An automated machine is sifting through the objects in the haystack and predicting for each object whether it is a straw or a needle. A reasonable predictor will predict a small number of objects as needles and a large number as straws. A prediction has two attributes – positive/negative and accurate/inaccurate.

Positive Prediction: the object at hand is predicted to be the needle. A small number.

Negative Prediction: the object at hand is predicted not to be a needle. A large number.

True_Positive: of the total number of predictions, the number of predictions that were positive and correct. Correctly predicted Positives (needles). A small number.

True_Negative: of the total number of predictions, the number of predictions that were negative and correct. Correctly predicted Negatives (straws). A large number.

False_Positive: of the total number of predictions, the number of predictions that are positive but the prediction is incorrect. Incorrectly predicted Positives (straw predicted as needle). Could be large as the number of straws is large, but assuming the total number of predicted needles is small, this is less than or equal to predicted needles, hence small.

False_Negative: of the total number of predictions, the number of predictions that are negative but the prediction is incorrect. Incorrectly predicted Negatives (needle predicted as straw). Is this a large number ? It is unknown – this class is not large just because the class of negatives is large – it depends on the predictor and a “reasonable” predictor which predicts most objects as straws, could also predict many needles as straws. This is less than or equal to the total number of needles, hence small.

Predicted_Positives = True_Positives + False_Positives = Total number of objects predicted as needles.

Actual Positives = Actual number of needles, which is independant of the number of predictions either way, however Actual Positives = True Positives + False Negatives.

Accuracy = nCorrect _Predictons/nTotal_Predictions=(nTrue_Positives+nTrue_Negatives) / (nPredicted_Positives +nPredicted_Negatives) .   # the reasonable assumption above is equivalent to a high accuracy. Most predictions will be hay, and be correct in this simply because of the skewed distribution. This does not shed light on FP or FN.

Precision = nTrue_Positives / nPredicted_Positives    # correctly_identified_needles/predicted_needles;  this sheds light on FP; Precision = 1 => FP=0 => all predictions of needles are in fact needles; a precision less than 1 means we got a bunch of hay with the needles – gives hope that with further sifting the hay can be removed.  Precision is also called Specificity and quantifies the absence of False Positives or incorrect diagnoses.

Recall = nTrue_Positives / nActual_Positives  = TP/(TP+FN)# correctly_identified_needles/all_needles; this sheds light on FN; Recall = 1 => FN = 0; a recall less than 1 is awful as some needles are left out in the sifting process. Recall is also called Sensitivity .

Precision > Recall => FN is higher than FP

Precision < Recall => FN is lower than FP

If at least one needle is correctly identified as a needle, both precision and recall will be positive; if zero needles are correctly identified, both precision and recall are zero.

F1 Score is the harmonic mean of Precision and Recall.  1/F1 = 1/2(1/P + 1/R) . F1=2PR/(P+R) .  F1=0 if P=0 or R=0. F1=1 if P=1 and R=1.

ROC/AUC rely on Recall (=TP/TP+FN) and another metric False Positive Rate defined as FP/(FP+TN)  = hay_falsely_identified_as_needles/total_hay . As TN >> FP, this should be close to zero and does not appear to be a useful metric in the context of needles in a haystack; as are ROC/AuC . The denominators are different in Recall and FPR, total needles and total hay respectively.

There’s a bit of semantic confusion when saying True Positive or False Positive. These shorthands can be interpreted as- it was known that an instance was a Positive and a label of True or False was applied to that instance. But what we mean is that it was not known whether the instance was a Positive, and that a determination was made that it was a Positive and this determination was later found to be correct (True) or incorrect (False). Mentally replace True/False with ‘Correct/Incorrectly identified as’ to remove this confusion.

Normalization: scale of 0-1, or unit norm; useful for dot products when calculating similarity.

Standardization: zero mean, divided by standard deviation; useful in neural network/classifier inputs

Regularization: used to reduce sensitivity to certain features. Uses regression. L1: Lasso regression L2: Ridge regression

Confusion matrix: holds number of predicted values vs known truth. Square matrix with size n equal to number of categories.

Bias, Variance and their tradeoff. we want both to be low. When going from a simple model to a complex one, one often goes from high bias to a high variance scenario. https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

Lacework Intrusion Detection System – Cloud IDS

Lacework Polygraph is a Host based IDS for cloud workloads. It provides a graphical view of who did what on which system, reducing the time for root cause analysis for anomalies in system behaviors. It can analyze workloads on AWS, Azure and GCP.

It installs a lightweight agent on each target system which aggregates information from processes running on the system into a centralized customer specific (MT) data warehouse (Snowflake on AWS)  and then analyzes the information using machine learning to generate behavioral profiles and then looks for anomalies from the baseline profile. The design allows automating analysis of common attack scenarios using ssh, privilege changes, unauthorized access to files.

The host based model gives detailed process information such as which process talked to which other and over what api. This info is not available to a network IDS. The behavior profiles reduce the false positive rates. The graphical view is useful to drill down into incidents.

OSQuery is a tool for gathering data from hosts, and this is a source of data aggregated for threat detection. https://www.rapid7.com/blog/post/2016/05/09/introduction-to-osquery-for-threat-detection-dfir/

Here’s an agent for libpcap https://github.com/lacework/pcap

It does not have an intrusion prevention (IPS) functionality. False positives on an IPS could block network/host access and negatively affect the system being protected, so it’s a harder problem.

Cloud based network isolation tools like Aviatrix might make IPS scenarios feasible by limiting the effect of an IPS.

Software Integrity Tools

There are a number of tools used to detect security issues in a software application codebase. A simple and free one is flawfinder. A sophisticated commercial one is Veracode.  There’s also lint, pylint, findbugs for java, and xcode clang static analyzer.

Synopsis has bought a few tools like Coverity and Blackduck for various static checks on code and binary. Blackduck can do binary analysis and scores issues with the CVSS. A common use of Blackduck is for license checking to check for conformance to open source licenses.

A more comprehensive list of static code analysis tools is here.

Dynamic analysis tools inspect the running process and find memory and execution errors. Well known examples are valgrind and Purify. More dynamic tools are listed here.

For web application security there are protocol testing and fuzzing tools like Burp suite and Tenable Nessus.

A common issue with the tools is the issue of false positives. It helps to limit the testing to certain defect types or attack scenarios and identify the most critical issues, then expand the scope of types of defects.

Code obfuscation and anti-tamper are another line of tools, for example by Arxan, Klocwork, Irdeto and Proguard .

A great talk on Adventures in fuzzing. My takeaway has been that better ways of developing secure software are really important.

 

 

SGX vs TPM vs SE – security in silicon

Hardware root of trust, Hardware security primitives

PUF https://en.m.wikipedia.org/wiki/Physical_unclonable_function

Specialized crypto ops vs small open/generic TCB.

VHDL libraries

Chain of trust, up to firmware and software layer. Level of integration vs modularity.

TouchID with SE . Samsung. Yubikey, Gemalto,

Electronic logging devices. ELD tamper proofing

Secure provisioning. Intel EPID

Rust, memory safey, cargo, toml, policy engine, fortanix, azure IoT

Key management with enclaves

Trusted VMs and cloud security

Platform security 

Isolated memory on shared infra

Zero Trust Networks

Instead of  the “inside” and “outside” notion of traditional firewalls and perimeter defense technologies, the Zero Trust Network notion has its origin in the Cloud+Mobile first world where a person carrying a mobile device can be anywhere in the world (inside/outside the enterprise) and needs to be seamlessly and securely connected to online services.

The essential idea appears to be device authentication coupled with a second factor in the shape of an easy to remember password, with backend security smarts to identify the accessing device. More importantly, every service that is access externally needs to be authenticated, instead of some services being treated as internal services and being less protected.

Some properties of zero trust networks:

  • Network locality based access control is insufficient
  • Every device, user and service is authenticated
  • Policies are dynamic – they gather and utilize data inputs for making access control decisions
  • Attacks from trusted insiders are mitigated against

This is a big change from many networks which have network based defense at the core (for good reason, as it was cost effective). To create a zero-trust network, a startin point is to identify, enumerate and sequence all network flows.

I attended a talk by Centrify on this topic, which resonated with experiences in cloud, mobile and fog systems.

Related effort in Kubernetes – Progress Toward Zero Trust Kubernetes Networks, Istio Service Mesh , API Gateway to Service Mesh.  One can contrast the API gateway as being present only at the ingress point of a cloud, whereas with a Zero-trust/Service-mesh/Sidecar approach every microservice building-block has its own external proxy and ‘API’ for management added to it. The latter would add to latency concerns for real-time applications, as the new sidecar proxies are in the data path. One benefit of the service mesh is a mechanism to put in service to service security in a uniform manner.

The key original motivation behind Istio, in the second presentation by Lyft above, was greater observability and reliability across a complex cluster of microservices. This strikes me as a greater motivating use-case of this technology, than added security.  From the security point of view, there is a parallel of the Istio approach with the SDN problem statement of a horizontal and ubiquitous security layer.   Greater visibility is also a motivation behind the P4 programming language presented in disaggregated storage talk on protocol independant switch architecture or PISA here – one of the things it enables is inband telemetry.

Hatman, Triton ICS Malware Analysis

A Triconex Industrial controller allows triple modular redundancy and 2/3 consensus vote based control.  The design has its origins in the 80’s industrial needs for safety for industrial controllers. The product was acquired by Schneider via Invensys in 2014. The Hatman/Triton malware framework targeting this specific controller came to light, late 2017. The Triconex is programmed with a TriStation, a Windows application which integrates with Windows directory and allows programming in FBD, LD, ST, CEM.

From the SchneiderElectric, Accenture and Mandiant analyses of the malware, more technical details appeared recently. A previous paper appeared in IEEE, Jan 2017. A brief summary is below.

Access to the controller network is necessary. The Triconex controller needs to be in Program mode. A malware program agent, TriLogger, running on Windows in the same network talks over a Tricon protocol to program the Triconex controller to install/deploy the control payload program. The malware payload program then runs like a regular program on the controller, on every scan cycle –  running in parallel in three versions.

Once on the controller, the malware looks for a way to elevate its privilege level. It starts observing the runtime, including memory inspections. There is a memory backdoor attempted, but there is a probable error handling mistake which prevents this. Now to be able to access the firmware, it takes advantage of a zero-day vulnerability in the firmware.  It is able to install itself in the firmware, overwriting a network function call. In the end it installs a Remote Access Terminal to allow remote access of the controller. This could have been a vector to download further payloads, but no evidence was found that this RAT was actually used. It attempts to remove traces of itself after installation.

Source code of the program is at  https://github.com/ICSrepo/TRISIS-TRITON-HATMAN .

Zero day attacks are a continuing challenge as by definition they are not widely known before they are used for an attack. However a secure by design approach reduces the attack surface for exploits. There were opportunities to detect the malware on the network and the windows host.

Update: A cert advisory for Triton appears in https://ics-cert.us-cert.gov/advisories/ICSA-18-107-02 and “Targeted Cyber Intrusion Detection and Mitigation Strategies” in https://ics-cert.us-cert.gov/tips/ICS-TIP-12-146-01B

Javascript Timing and Meltdown

In response to meltdown/spectre side-channel vulnerabilities, which are based on fine grained observation of the CPU to infer cache state of an adjacent process or VM, a mitigration response by browsers was the reduction of the time resolution of various time apis, especially in javascript.

The authors responded with alternative sources of finding fine grained timing, available to browsers. An interpolation method allows obtaining of a fine resolution of 15 μs, from a timer that is rounded down to multiples of 100 ms.

The javascript  high resolution time api is still widely available and described at https://www.w3.org/TR/hr-time/ with a reference to previous work on cache attacks in Practical cache attacks in JS

A meltdown PoC is at https://github.com/gkaindl/meltdown-poc, to test the timing attack in its own process. The instruction RDTSC returns the Time Stamp Counter (TSC), a 64-bit register that counts the number of cycles since reset, and so has a resolution of 0.5ns on a 2GHz CPU.

int main() {
 unsigned long i;
 i = __rdtsc();
 printf("%lld\n", i);
}

One Time Passwords for Authentication

A MAC or Message Authentication Code protects the integrity and authenticity of a message by allowing verifiers to detect changes to the message content. It requires a random key generation algo that produces a per-message random key K, a signing algorithm which takes K and message M as input and produces signature S, and a verifying algorithm with takes K,  M and S as input and produces a binary decision to accept or reject the message. Unlike a digital signature a MAC typically does not provide non-repudiation. It is also called a protected checksum. Both sender and recipient of the K and M share a secret key.

HMAC: So called Hashed-MAC because it uses a cryptographic hash function, such as MD/SHA to create a MAC. The computed value is something only someone with the secret key can compute (sign) and check (verify). HMAC uses an inner key and an outer key to protect against length extension and collision attacks on simple MAC signature implementations. RFC 2104. It is a type of Nested MAC (NMAC) where both inner and outer keys are derived from the same key, in a way that keeps the derived keys independent.

HOTP: HMAC-based One Time Password.   HOTP is based on an incrementing counter. The incrementing counter serves as the message M, and when run through the HMAC it produces a random set of bytes, which can be verified by the receiving party. The receiving party keeps a synchronized counter, so the message M=C does not need to be send on the wire. RFC 4226.

TOTP: Time-based One-time Password Algorithm . TOTP combines a secret key with the current timestamp using a cryptographic hash function to generate a one-time password. Because network latency and out-of-sync clocks can result in the password recipient having to try a range of possible times to authenticate against, the timestamp typically increases in 30-second intervals. Here the requirement to keep the counter synchronized is replaced with time synchronization. RFC 6328.

 

OCSP validation and OCSP stapling with letsencrypt

Online Certificate Status Protocol (OCSP) is a mechanism for browsers to check the validity of certificates presented by HTTPS websites. This guards against revoked certificates. This has been an issue for big websites, which had bad certs issued (for their domain to entities other than the owning company) and had to be revoked upon complaints to the cert issuer. Google has stated its intent to begin distrusting Symantec certs in 2018. A counterpoint to Google appears in this interesting article which notes deficiencies in Chrome’s implementation of OCSP, and of privacy issue for the website visitors with OCSP checks.

Let’s look at what Mozilla is doing about this, as they have attempted to implement OCSP correctly.

https://bugzilla.mozilla.org/show_bug.cgi?id=1366100

Telemetry indicates that fetching OCSP results is an important cause of slowness in the first TLS handshake. Firefox is, today, the only major browser still fetching OCSP by default for DV certificates.

In Bug 1361201 we tried reducing the OCSP timeout to 1 second (based on CERT_VALIDATION_HTTP_REQUEST_SUCCEEDED_TIME), but that seems to have caused only a 2% improvement in SSL_TIME_UNTIL_HANDSHAKE_FINISHED.

This bug is to disable OCSP fetching for DV certificates. OCSP fetching should remain enabled for EV certificates.

OCSP stapling will remain fully functional. We encourage everyone to use OCSP stapling.

[DV is basic Domain Validation, EV is Extended Validation with more ownership checks]

So they are moving away from OCSP to OCSP stapling. From wikipedia, “OCSP stapling, formally known as the TLS Certificate Status Request extension, is an alternative approach to the Online Certificate Status Protocol(OCSP) for checking the revocation status of X.509 digital certificates.[1] It allows the presenter of a certificate to bear the resource cost involved in providing OCSP responses by appending (“stapling”) a time-stamped OCSP response signed by the CA to the initial TLS handshake, eliminating the need for clients to contact the CA”.

From the OCSP RFC

The Online Certificate Status Protocol (OCSP) enables applications to determine the (revocation) state of identified certificates.  OCSP may be used to satisfy some of the operational requirements of providing more timely revocation information than is possible with CRLs and may also be used to obtain additional status information. An OCSP client issues a status request to an OCSP responder and suspends acceptance of the certificates in question until the responder provides a response.

How to setup OCSP stapling with letsencrypt:

The CSR request can request a OCSP_MUST_STAPLE option (PARAM_OCSP_MUST_STAPLE=”yes”). The issued cert will then have the parameter with OID 1.3.6.1.5.5.7.1.24 and an OCSP URI which are shown with

openssl x509 -in cert.pem -text  |grep 1.24

1.3.6.1.5.5.7.1.24:

openssl x509 -noout -ocsp_uri -in cert.pem

http://ocsp.int-x3.letsencrypt.org

This changes the cert itself that is issued, so that independent of a webserver/haproxy/nginx configuration that may disable OCSP, the browser can check this cert upon download and show the status of the OCSP check.  This reduces the reliance on the website operator, who may be rogue. Ok, but what happens if a cert gets revoked after initial issuance ? The check should be done periodically – but by whom ? Instead of disabling OCSP, the webserver should enable the periodic rechecking and update stapling.

Also this parameter is not shown in the safari/firefox on cert inspection, but shows up on command line as param 1.3.6.1.5.5.7.1.24 being present in the downloaded cert.

https://esham.io/2016/01/ocsp-stapling

  1. Figure out which of the Let’s Encrypt certificates was used to sign your certificate.
  2. Download that certificate in PEM format.
  3. Point nginx to this file as the “trusted certificate”.
  4. In your nginx.conf file, add these directives to the same block that contains your other ssl directives:
    ssl_stapling on;
    ssl_stapling_verify on;
    ssl_trusted_certificate ssl/chain.pem;

Weeping Angel Remote Camera Monitor

Sales of hardware camera blockers and similar devices should increase, with the Weeping Angel disclosure. Wikileaks detailed how the CIA and MI5 hacked Samsung TVs to silently monitor remote communications. Interesting to read for the level of technical detail: https://wikileaks.org/vault7/document/EXTENDING_User_Guide/EXTENDING_User_Guide.pdf . The attack was called ‘Weeping Angel’, a term borrowed from Doctor Who.

Other such schemes are described at https://wikileaks.org/vault7/releases/#Weeping%20Angel , including a iPhone implant to get data from your phone – https://wikileaks.org/vault7/document/NightSkies_v1_2_User_Guide/NightSkies_v1_2_User_Guide.pdf .

IOT security vulnerabilities and hacks

Mirror of https://github.com/nebgnahz/awesome-iot-hacks :

A curated list of hacks in IoT space so that researchers and industrial products can address the security vulnerabilities (hopefully).

Thingbots

RFID

Home Automation

Connected Doorbell

Hub

Smart Coffee

Wearable

Smart Plug

Cameras

Traffic Lights

Automobiles

Airplanes

Light Bulbs

Locks

Smart Scale

Smart Meters

Pacemaker

Thermostats

Fridge

Media Player & TV

Toilet

Toys