Skip to content

Secure Machinery

On the evolution of security and intelligent machinery

About
IoT Security

Recent Posts

On Agent-Native Research Artifact, or ARA
Neurosymbolic reasoning
Hyperagents – what they are (and why they are not taking over the world)
Hessians and optimizers
Loss functions and optimizers – Adam and Muon and the Hessian of the loss function.
Understanding Reasoning in Thinking Language Models via Steering Vectors – a summary and analysis
Information geometry (and model interventions)
Toy models of superposition – Anthropic paper summary
Invitation Is All You Need: How a Calendar Event Became an Attack Vector
Chronological list of known learned representations (increasing date)
Learned Representations in Neural Networks
Anthropic: Activations to Interpretable features with Monosemanticity
Absolute Zero: zero reliance on external data to improve model reasoning
RDMA, Infiniband, RoCE, CXL : High-Performance Networking Technologies for AI
vLLM project – overview, comparisons, PagedAttention mechanism
AI Risks Repository from MIT
Sizing an LLM for GPU memory
SageMaker Hyperpod for Distributed Model Training
LLM optimization – PEFT, LORA, QLORA
Direct Preference Optimization (DPO) vs RLHF/PPO (Reinforcement Learning with Human Feedback, Proximal Policy Optimization)

Recent Comments

	Megan Proctor on Building Automation Security P…
	sanakhan7 on Feature Vectors, Embeddings, V…
	Rodney Dangerfield on Git Merge. You are in the midd…
	Maila on ML Transformer and GPT-2 …
	LLM evolution – Anth… on LLM Inferencing is hard…

Archives

Categories

Meta

Create account
Log in
Entries feed
Comments feed
WordPress.com

Follow Secure Machinery on WordPress.com

EC2 Trainium UltraClusters

Written by Ruchir Tewari

Each EC2 Trn1 instance has

up to 16 AWS Trainium accelerators purpose built to accelerate DL training and deliver up to 3.4 petaflops of FP16/BF16 compute power. Each accelerator includes two second-generation NeuronCores
512 GB of shared accelerator memory (HBM) with 9.8 TB/s of total memory bandwidth
1600 Gbps of Elastic Fabric Adapter (EFAv2)

An EC2 Trn1 UltraCluster, consists of densely packed, co-located racks of Trn1 compute instances interconnected by non-blocking petabyte scale networking. It is our largest UltraCluster to date, offering 6 exaflops of compute power on demand with up to 30,000 Trainium chips.

https://aws.amazon.com/blogs/machine-learning/scaling-large-language-model-llm-training-with-amazon-ec2-trn1-ultraclusters/

Share this:

Email
X
LinkedIn
Facebook

Like Loading...

February 26, 2023April 21, 2024 · Posted in AWS, deep learning, gpu, ml · Tagged ml ·

Leave a comment Cancel reply

Δ

Post navigation

« Weights vs Activations

EC2 P5 UltraClusters »

Blog at WordPress.com.

Comment
Reblog
Subscribe Subscribed
- Secure Machinery
- Already have a WordPress.com account? Log in now.

Loading Comments...

Write a Comment...

Email (Required)

Name (Required)

Website

%d