Skip to content

Secure Machinery

On the evolution of security and intelligent machinery

About
IoT Security

Recent Posts

On Agent-Native Research Artifact, or ARA
Neurosymbolic reasoning
Hyperagents – what they are (and why they are not taking over the world)
Hessians and optimizers
Loss functions and optimizers – Adam and Muon and the Hessian of the loss function.
Understanding Reasoning in Thinking Language Models via Steering Vectors – a summary and analysis
Information geometry (and model interventions)
Toy models of superposition – Anthropic paper summary
Invitation Is All You Need: How a Calendar Event Became an Attack Vector
Chronological list of known learned representations (increasing date)
Learned Representations in Neural Networks
Anthropic: Activations to Interpretable features with Monosemanticity
Absolute Zero: zero reliance on external data to improve model reasoning
RDMA, Infiniband, RoCE, CXL : High-Performance Networking Technologies for AI
vLLM project – overview, comparisons, PagedAttention mechanism
AI Risks Repository from MIT
Sizing an LLM for GPU memory
SageMaker Hyperpod for Distributed Model Training
LLM optimization – PEFT, LORA, QLORA
Direct Preference Optimization (DPO) vs RLHF/PPO (Reinforcement Learning with Human Feedback, Proximal Policy Optimization)

Recent Comments

	Megan Proctor on Building Automation Security P…
	sanakhan7 on Feature Vectors, Embeddings, V…
	Rodney Dangerfield on Git Merge. You are in the midd…
	Maila on ML Transformer and GPT-2 …
	LLM evolution – Anth… on LLM Inferencing is hard…

Archives

Categories

Meta

Create account
Log in
Entries feed
Comments feed
WordPress.com

Follow Secure Machinery on WordPress.com

EC2 P5 UltraClusters

Written by Ruchir Tewari

Each P5 EC2 instances has

eight NVIDIA H100 GPUs capable of 16 petaFLOPs of mixed-precision performance
640 GB of high-bandwidth memory, 80GB in each GPU
3,200 Gbps networking connectivity (8x more than the previous generation)

The increased performance of P5 instances accelerates the time-to-train machine learning (ML) models by up to 6x (reducing training time from days to hours), and the additional GPU memory helps customers train larger, more complex models.

P5 instances are expected to lower the cost to train ML models by up to 40% over the previous generation, providing customers greater efficiency over less flexible cloud offerings or expensive on-premises systems.

https://nvidianews.nvidia.com/news/aws-and-nvidia-collaborate-on-next-generation-infrastructure-for-training-large-machine-learning-models-and-building-generative-ai-applications

Nvidia H100 GPU overview and data sheet – https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper

Diagram of P4d UltraClusters

P4d consists of 8 A100 GPUs, with 40GB GPU Memory each

P4de consists of 8 A100 80GB GPUs, with 80GB GPU memory each

Nvidia blog on HGX baseboard supporting 8 A100 GPUs – https://developer.nvidia.com/blog/introducing-hgx-a100-most-powerful-accelerated-server-platform-for-ai-hpc/

A100 80GB data sheet – https://www.nvidia.com/en-us/data-center/a100/

MIG support in A100 – https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/ and MIG user guide – https://docs.nvidia.com/datacenter/tesla/mig-user-guide

MIG support in AWS EC2 instance type P4d and in AWS EKS – https://developer.nvidia.com/blog/amazon-elastic-kubernetes-services-now-offers-native-support-for-nvidia-a100-multi-instance-gpus/

GCP A2 adds 16 A100 GPUs to a node – https://cloud.google.com/blog/products/compute/announcing-google-cloud-a2-vm-family-based-on-nvidia-a100-gpu

https://cloud.google.com/blog/products/containers-kubernetes/gke-now-supports-multi-instance-gpus

Running more pods/gpu on EKS with MIG – https://medium.com/itnext/run-more-pods-per-gpu-with-nvidia-multi-instance-gpu-d4f7fb07c9b5

Nvidia Embraces The CPU World With “Grace” Arm Server Chip

Share this:

Email
X
LinkedIn
Facebook

Like Loading...

March 24, 2023April 21, 2024 · Posted in AWS, deep learning, gpu, ml · Tagged ml ·

Leave a comment Cancel reply

Δ

Post navigation

« EC2 Trainium UltraClusters

Langchain example »

Blog at WordPress.com.

Comment
Reblog
Subscribe Subscribed
- Secure Machinery
- Already have a WordPress.com account? Log in now.

Loading Comments...

Write a Comment...

Email (Required)

Name (Required)

Website

%d