Interesting position paper from CNCF on securing the software supply chain. However I think software development is much more collaborative than manufacturing and the manufacturing supply chain has not yet reached this level of security maturity.
Distributed training aims to reduce the training time of a model in machine learning, by splitting the training workload across multiple nodes. As both the data and the model sizes have grown, distributed training has become an area of focus in ML. Training consists of iteratively minimizing an objective function by running the data through a model and determining a) the error and the gradients with which to adjust the model parameters (forward path) and b) the updated model parameters using calculated gradients (reverse path). The latter step always requires synchronization between the nodes, in some cases the first also requires communication.
There are three approaches to distributed training – data parallelism, model parallelism and data-model parallelism. Data parallelism is more common and preferred if the model fits in GPU memory.
In data parallelism, we partition the data on to different GPUs and and run the same model on these data partitions. The same model is present in all GPU nodes and no communication between nodes is needed on the forward path. The calculated parameters are sent to a parameter server, which averages them, and updated parameters are retrieved back by all the nodes to update their models to the same incrementally updated model.
In model parallelism, we partition the model itself into parts and run these on different GPUs.
To communicate the intermediate results between nodes the MPI primitives are leveraged, including AllReduce.
The amount of training data for BERT is ~600GB. BERT-Tiny model is 17MB, BERT-Base model is ~400MB. During training a 16GB memory GPU sees an OOM error.
Some links to resources –
https://mccormickml.com/2019/11/05/GLUE/ Origin of General Language Understanding Evaluation.
Horovod core principles are based on the MPI concepts size, rank, local rank, allreduce, allgather, and broadcast. These are best explained by example. Say we launched a training script on 4 servers, each having 4 GPUs. If we launched one copy of the script per GPU:
- Size would be the number of processes, in this case, 16.
- Rank would be the unique process ID from 0 to 15 (size – 1).
- Local rank would be the unique process ID within the server from 0 to 3.
- Allreduce is an operation that aggregates data among multiple processes and distributes results back to them. Allreduce is used to average dense tensors. Here’s an illustration from the MPI Tutorial:
- Allgather is an operation that gathers data from all processes in a group then sends data back to every process. Allgather is used to collect values of sparse tensors. Here’s an illustration from the MPI Tutorial:
- Broadcast is an operation that broadcasts data from one process, identified by root rank, onto every other process. Here’s an illustration from the MPI Tutorial:
ML Training on images and text together leads to certain neurons holding information of both images and text – multimodal neurons.
When the type of the detected object can be changed by tricking the model into recognizing a textual description instead of a visual description- that can be called a typographic attack.
Intriguing concepts indicating that a fluid crossover from text to images and back is almost here.
Bitcoin reached a $1T market cap last month. https://www.msn.com/en-us/news/technology/bitcoin-reaches-dollar1-trillion-valuation-twice-as-fast-as-amazon/ar-BB1fF3Bl
A Bitcoin halving event is scheduled to take place every 210,000 blocks. This reduces the payoff of securing a block by half. Three Bitcoin halvings have taken place so far in 2012, 2016, 2020. The next halving is predicted to occur in 2024. The corresponding block reward went from 50btc in 2009 to 25 in ‘12, 12.5 in ‘16, 6.25 in ‘20 and 3.125 in ‘24. https://www.coinwarz.com/mining/bitcoin/halving
The rate of production of bitcoin over time is shown below. Mining will continue until 21million btc are created.
VeChain is a blockchain proposal/implementation for supply chain tracking.
EdgeChain is an architecture for placement of applications on the edge amongst multiple resources from multiple providers. It is built on Vechain for Mobile and Edge Computing use cases.
The NVidia Volta-100 GPU released in Dec 2017 was the first microprocessor with dedicated cores purely for matrix computations called Tensor Cores. The Ampere-100 GPU released May’20 is its successor. A comparison is provided here. Tensor Cores reduce the cycle time for matrix multiplications, operating on 4×4 matrices of 16bit floating point numbers. These GPUs are aimed at Deep Learning use cases which consist of a pipeline of matrix operations.
In math, we have Scalars and Vectors. Scalars are used for magnitude and Vectors encode magnitude and direction. To transform Vectors, one applies Linear Transformations in the form of Matrices. Matrices for Linear Transformations have EigenVectors and EigenValues which describe the invariants of the transformation. A Tensor in math and physics is a concept that exhibits certain types invariance during transformations. In 3 dimensions, a Stress Tensor has 9 components, which can be representated as a 3×3 matrix.
In Deep Learning applications a Tensor is basically a Matrix. The Generalized Matrix Multiplication (GEMM) operation, D=AxB+C, is at the heart of Deep Learning, and Tensor Cores are designed to speed these up.
Tesla Dojo is an advertised attempt to build a processor/computer dedicated for Deep Learning to process vast amounts of video data.
AWS Inferentia is a chip for deep learning inferencing, with its four Neuron Cores .
Generally speaking the desire in deep learning community is to have simpler processing units in larger numbers.
SAP Transportation Management or SAP TM is a module used for Supply Chain Optimization.
SAP TM has four different optimizer engines –
VSR Optimizer: Plan Shipments in the best possible way on available Vehicles via available routes. TVSR (Vehicle scheduling and routing), TVSS, TVRG Applications come under this.
Load Optimizer: Arrange pallets or packages on the vehicle considering rules like Stackability, etc. (TVSO Application)
Carrier Selection: Rank carriers for each shipment considering costs, Business Shares, Allocations. (TSPS Application)
Strategic Freight Management: Rank bids by carriers for long-term contracts based on Cost, Capacity & Risk. (TSFM Application)
The need for Transportation Management as a service is justified by several use cases.
Many recent announcements from leading car manufacturers and other companies whose business models are susceptible to disruption are adopting TaaS platforms (through in-house development efforts, partnerships, or acquisitions) to provide services:
- Toyota launches a new Mobility Ecosystem and Concept Vehicle powered by its Mobility Services Platform
- BMW introduces their ReachNow ride hailing service powered by RideCell’s platform
- Ford acquires Autonomic as it evolves its mobility business
- AAA announces Gig Car Share service
- Telecom Giants fear missing the money as cars go online
The role of APIs in modernizing supply chain systems from legacy EDI based designs – https://www.coupa.com/blog/supply-chain/tech-forward-apis-emerging-player-supply-chain
A comparison of API vs EDI systems – https://arcb.com/blog/edi-vs-api-which-is-right-for-my-business
Some definitions from Wikipedia to clarify concepts-
Logistics is generally the detailed organization and implementation of a complex operation. In a general business sense, logistics is the management of the flow of things between the point of origin and the point of consumption to meet the requirements of customers or corporations.
The resources managed in logistics may include tangible goods such as materials, equipment, and supplies, as well as food and other consumable items.
Logistics management is the part of supply chain management and supply chain engineering that plans, implements, and controls the efficient, effective forward, and reverse flow and storage of goods, services, and related information between the point of origin and point of consumption to meet customer’s requirements. The complexity of logistics can be modeled, analyzed, visualized, and optimized by dedicated simulation software.
The minimization of the use of resources is a common motivation in all logistics fields.
A supply chain is the connected network of individuals, organizations, resources, activities, and technologies involved in the manufacture and sale of a product or service.
How can we be better prepared for a future crisis relative to supply chains?
Private companies have playbooks for supply chain disruptions in their network. In supply chain management, it is crucial to diversify your source of supplies so that when one supplier is impacted, you can turn to the other.
Kubernetes is a Platform-as-a-Service (PAAS) similar to Cloud Foundry. It has more a centralized control plane compared to Cloud Foundry.
A threat matrix for Kubernetes – https://www.microsoft.com/security/blog/2020/04/02/attack-matrix-kubernetes/
From RSA’20, a talk on The future of Kubernetes attacks – https://youtu.be/CH7S5rE3j8w
Coinbase: Why Kubernetes is not part of our stack – https://blog.coinbase.com/container-technologies-at-coinbase-d4ae118dcb6c makes these points
- it needs a full-time compute team to maintain
- securing it is neither trivial nor well understood. SPIFFE, SPIRE, Envoy, Istio, OPA, service mesh are a few of the technologies.
This blog links to – https://k8s.af/
Another similar viewpoint – https://pythonspeed.com/articles/dont-need-kubernetes/
A counterpoint to the Coinbase blog – https://blog.kumina.nl/2020/07/in-response-to-container-technologies-at-coinbase/
K8S is based on a Controller pattern:
- Resources capture the desired state.
- Current state is kept centralized in etcd, a distributed key-value store (similar to Consul).
- Controllers reconcile current state with desired state.
Pod is a top level resource, is the smallest deployment unit, and is a group of one or more containers described by a yaml file, similar to docker-compose.yml .
K8S Operator is a kind of resource manager, for Custom resources.
Spinnaker – Continuous Delivery platform that itself runs on k8s as a set of pods which can be scaled up
kubectl cheat sheet:
An article on cloud security https://medium.com/xm-cyber/having-fun-with-cloud-services-e281f8a7fe60
Creating my first Quantum Crypto keys – QC hardware generated Encapsulation Key using Classic McEliece. Keys were emailed to me after generation at the Cambridge Quantum Computing booth at RSA World 2020.
Collection of interesting talks on AWS security at re:Invent and re:Inforce 2019.
We want to walk through some common metrics in classification problems such as accuracy, precision and recall and get a feel for when to use which metric. Say we are looking for a needle in a haystack. There are very few needles in a large haystack full of straws. An automated machine is sifting through the objects in the haystack and predicting for each object whether it is a straw or a needle. A reasonable predictor will predict a small number of objects as needles and a large number as straws.
Positive Prediction: the object at hand is predicted to be the needle. A small number.
Negative Prediction: the object at hand is predicted not to be a needle. A large number.
True_Positive: of the total number of predictions, the number of predictions that were positive and correct. Correctly predicted Positives (needles). A small number.
True_Negative: of the total number of predictions, the number of predictions that were negative and correct. Correctly predicted Negatives (straws). A large number.
False_Positive: of the total number of predictions, the number of predictions that are positive but the prediction is incorrect. Incorrectly predicted Positives (straw predicted as needle). Could be large as the number of straws is large, but assuming the total number of predicted needles is small, this is less than or equal to predicted needles, hence small.
False_Negative: of the total number of predictions, the number of predictions that are negative but the prediction is incorrect. Incorrectly predicted Negatives (needle predicted as straw). Is this a large number ? It is unknown – this class is not large just because the class of negatives is large – it depends on the predictor and a “reasonable” predictor which predicts most objects as straws, could also predict many needles as straws. This is less than or equal to the total number of needles, hence small.
Predicted_Positives = True_Positives + False_Positives = Total number of objects predicted as needles.
Actual Positives = Actual number of needles, which is independant of the number of predictions either way, however Actual Positives = True Positives + False Negatives.
Accuracy = nCorrect _Predictons/nTotal_Predictions=(nTrue_Positives+nTrue_Negatives) / (nPredicted_Positives +nPredicted_Negatives) . # the reasonable assumption above is equivalent to a high accuracy. Most predictions will be hay, and be correct in this simply because of the skewed distribution. This does not shed light on FP or FN.
Precision = nTrue_Positives / nPredicted_Positives # correctly_identified_needles/predicted_needles; this sheds light on FP; Precision = 1 => FP=0 => all predictions of needles are in fact needles; a precision less than 1 means we got a bunch of hay with the needles – gives hope that with further sifting the hay can be removed. Precision is also called Specificity and quantifies the absence of False Positives or incorrect diagnoses.
Recall = nTrue_Positives / nActual_Positives = TP/(TP+FN)# correctly_identified_needles/all_needles; this sheds light on FN; Recall = 1 => FN = 0; a recall less than 1 is awful as some needles are left out in the sifting process. Recall is also called Sensitivity .
Precision > Recall => FN is higher than FP
Precision < Recall => FN is lower than FP
If at least one needle is correctly identified as a needle, both precision and recall will be positive; if zero needles are correctly identified, both precision and recall are zero.
F1 Score is the harmonic mean of Precision and Recall. 1/F1 = 1/2(1/P + 1/R) . F1=2PR/(P+R) . F1=0 if P=0 or R=0. F1=1 if P=1 and R=1.
ROC/AUC rely on Recall (=TP/TP+FN) and another metric False Positive Rate defined as FP/(FP+TN) = hay_falsely_identified_as_needles/total_hay . As TN >> FP, this should be close to zero and does not appear to be a useful metric in the context of needles in a haystack; as are ROC/AuC . The denominators are different in Recall and FPR, total needles and total hay respectively.
There’s a bit of semantic confusion when saying True Positive or False Positive. These shorthands can be interpreted as- it was known that an instance was a Positive and a label of True or False was applied to that instance. But what we mean is that it was not known whether the instance was a Positive, and that a determination was made that it was a Positive and this determination was later found to be correct (True) or incorrect (False). Mentally replace True/False with ‘Correct/Incorrectly identified as’ to remove this confusion.
Normalization: scale of 0-1, or unit norm; useful for dot products when calculating similarity.
Standardization: zero mean, divided by standard deviation; useful in neural network/classifier inputs
Regularization: used to reduce sensitivity to certain features. Uses regression. L1: Lasso regression L2: Ridge regression
Confusion matrix: holds number of predicted values vs known truth. Square matrix with size n equal to number of categories.
I wanted to get a better understanding of firecracker microVM security, from the bottom up. A few questions –
a) how does firecracker design achieve a smaller threat surface than a typical vm/container ?
b) what mechanisms are available to secure code running in a microvm ?
c) and lastly, how can microvms change security considerations when deploying code for web services ?
The following design elements contribute to a smaller threat surface:
- minimal design, in a memory safe, compact, readable rust language
- minimal guest virtual device model: a network device, a block I/O device, a timer, a KVM clock, a serial console, and a partial keyboard
- minimal networking; from docs/vsock.md : “The Firecracker vsock device aims to provide full virtio-vsock support to software running inside the guest VM, while bypassing vhost kernel code on the host. To that end, Firecracker implements the virtio-vsock device model, and mediates communication between AF_UNIX sockets (on the host end) and AF_VSOCK sockets (on the guest end).”
- static linking of the firecracker process limits dependancies
- seccomp BPF limits the system calls to 35 allowed calls, 30 with simple filtering, 5 with advanced filtering that limits the call based on parameters (SeccompFilter::new call in vmm/src/default_syscalls/filters.rs, seccomp/src/lib.rs)
The production security setup recommends using jailer to apply isolation based on cgroups, namespaces, seccomp. These techniques are typical of container isolation and act in addition to KVM based isolation.
The Firecracker Host Security Configuration recommends a series of checks to mitigate side-channel issues for a multi-tenant system:
- Disable Simultaneous Multithreading (SMT)
- Check Kernel Page-Table Isolation (KPTI) support
- Disable Kernel Same-page Merging (KSM)
- Check for speculative branch prediction issue mitigation
- Apply L1 Terminal Fault (L1TF) mitigation
- Apply Speculative Store Bypass (SSBD) mitigation
- Use memory with Rowhammer mitigation support
- Disable swapping to disk or enable secure swap
How is the firecracker process organized ? The docs/design.md has the following descriptions:
Internal Design: Each Firecracker process encapsulates one and only one microVM. The process runs the following threads: API, VMM and vCPU(s). The API thread is responsible for Firecracker’s API server and associated control plane. It’s never in the fast path of the virtual machine. The VMM thread exposes the machine model, minimal legacy device model, microVM metadata service (MMDS) and VirtIO device emulated Net and Block devices, complete with I/O rate limiting. In addition to them, there are one or more vCPU threads (one per guest CPU core). They are created via KVM and run the `KVM_RUN` main loop. They execute synchronous I/OÂ and memory-mapped I/O operations on devices models.
Threat Containment: From a security perspective, all vCPU threads are considered to be running malicious code as soon as they have been started; these malicious threads needÂ to be contained. Containment is achieved by nesting several trust zones which increment from least trusted or least safe (guest vCPU threads) to most trusted or safest (host). These trusted zones are separated by barriers that enforce aspects of Firecracker security. For example, all outbound network traffic data is copied by the Firecracker I/O thread from the emulated network interface toÂ the backing host TAP device, and I/O rate limiting is applied at this point.
What about mechanisms to secure the code running inside firecracker ? The serverless environment, AWS Lambda, and its security best practices are a place to start. Resources on these are here, here, here and here. AWS API gateway supports input validation, as described here. While serverless reduces the attack surface, the web threats such as OWASP still apply and must be taken into account during design and testing.
For the last question – uVMs and serverless appear to offer a promising model to build a service incrementally from small secure building blocks – and this is something to explore further.
These are some notes from a talk by Aviatrix last week. Many customers get started with Aviatrix orchestration system for deploying AWS Transit Gateway (TGW) and Direct Connect. The transit gateway is the hub gateway that connects multiple VPCs with an on-premise link, possibly over Direct Connect. The Aviatrix product can then deploy and manage multiple VPCs and the communication between them, directing which VPC can talk to which other VPC. It controls the communication by simply deleting the routes.
The advanced transit controller solution is useful for multiple regions, to manage the communication between regions. Another aspect is there are high speed interconnects between the cloud providers and Aviatrix builds an overlay that bridges between public clouds. Multi-account communication and secure communication between the networks using segmentation can be enabled.
According to Aviatrix, AWS’s motto is go build, and do it yourself, it is designed for the builders. But when you go beyond 3 VPCs to 3000 VPCs, one needs a solution to manage the routes in an automated manner. This is the situation for many larger customers. For smaller ones where there are Production, Development and Edge/On-premise network components to manage it also finds use.
Remote user VPN is another use case. Not only can one VPN in and get to all the VPCs, but specify which CIDR they can get to and other restrictions.
“The attention mechanism allows the model to create the context vector as a weighted sum of the hidden states of the encoder RNN at each previous timestamp.”
“Transformer is a type of model based entirely on attention, and does not require recurrent or convolutional layers”
Context vector is the output of the Encoder in an Encoder-Decoder network (EDN). EDNs struggle to retain all the required information for the decoder to accurately decode. Attention is a mechanism to solve this problem.
“Attention mechanisms let a model directly look at, and draw from, the state at any earlier point in the sentence. The attention layer can access all previous states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens.”
GPT: Generative Pre-Trained Transformer. Unlike BERT, it is generative and not geared to comprehension/translation/summarization tasks, but writing/generative tasks. BERT is a response to GPT and GPT-2 is in turn a response to BERT. GPT-2 was released Feb’2019 and is trained on 40Gb of text
This attention concept looks akin to a fourier or laplace transform which encodes the entire input signal in a lossless manner – just my observation. Although implemented differently it’s a way to keep track of and refer to global state.
AutoML and Transformer – http://ai.googleblog.com/2019/06/applying-automl-to-transformer.html
BERT and GPT are both based on the Transformer ideas. BERT is bidirectional and better at ccomprehending meaning from the whole sentence/phrase whereas GPT is better at generating text.
Bahdanau, 2014 https://arxiv.org/abs/1409.0473
“The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.”
Omnisci is a columnar database that reads a column into GPU memory, in compressed form, allowing for interactive queries on the data. A single gpu can load 10million to 50million rows of data and allows interactive querying without indexing. A demo was shown at the GTC keynote this year, by Aaron Williams. He gave a talk on vehicle analytics that I attended last month.
In the vehicle telemetry demo, they obtain vehicle telemetry data from an F1 game that has data output as UDP, 10s of thousands of packets a second – take the binary data off of UDP, and convert it to json and use it as a proxy for real telemetry data. The webserver refreshes every 3-4 seconds. The use case is analysis of increasing amounts of vehicle sensor data as discussed in this video and described in the detailed Omnisci blog post here.
The vehicle analytics demo pipeline consisted of UDP to Kafka, Kafka to JSON, then JSON to OmniSci via pymapd . Kafka serves as a message broker and also for playback of data.
Based on the the GPU loaded data, the database allows queries and stats on different vehicles that are running.
The entire system runs in the cloud on a VM supporting Nvidia GPUs, and can also be run on a local GPU box.