Starting a thread to collect notes on running AI models locally …
NVIDIA didn’t want me to do this
The video walks through building and benchmarking an 8‑node NVIDIA DGX Spark cluster with 1 TB of effective “VRAM” for running very large LLMs, including new 400B‑class models, and what actually scales well vs. what doesn’t.
Hardware and networking setup
- He clusters four DGX Sparks first, each with 128 GB GPU memory, for a total of 512 GB, using NVIDIA ConnectX‑7 NICs and 400 Gbps‑class QSFP56 cabling through a MikroTik 400 GbE switch that supports RDMA/RoCE.
- He discovers his switch port is hard‑coded to 50 Gbps and, with help from Claude and the switch CLI, reconfigures ports to 100 Gbps per interface, reaching 200 Gbps per Spark via dual virtual interfaces per physical port.
- He validates low latency with Infiniband/RDMA tests: ~3 µs via the switch and ~2 µs when directly connecting two Sparks.
Software, tooling, and clustering stack
- He uses a community GitHub project by “eugr” (spark‑vllm‑docker and llama‑beni) that provides a Docker‑based vLLM cluster setup and realistic LLM benchmarking tools.
- NCCL (“nickel”) handles multi‑node GPU collectives over RoCE, and he confirms it is actually using RoCE and both 100 Gbps links per node for ~200 Gbps effective bandwidth.
- Cluster orchestration tasks (SSH mesh, network config, copying models, running commands on all nodes) are largely automated via Claude acting over SSH on each machine and the managed switch.
Performance with 1–4 DGX Sparks
- On a “small” dense model (Qwen‑3 34B BF16, ~8 GB), four‑node clustering shows:
- Prompt prefill (PP 2048) around 6300–8000 tokens/s depending on node count; generation around 50 tokens/s.
- Scaling from 1→2→4 nodes improves generation speed significantly vs. a single Spark (e.g., 23 t/s on 1 Spark vs. ~35 on 2 vs. ~50+ on 4), but network speed changes from 50→100 Gbps mainly impact generation latency rather than prefill.
- He notes some counter‑intuitive behavior: moving from 50 Gbps to 100 Gbps reduces prefill performance by ~19 % but improves generation by ~7 %, which he attributes to generation being communication‑bound while prefill is more compute‑bound and sensitive to other factors.
Scaling to 8 nodes and large models
- He upgrades to a MikroTik switch with four 400 Gbps ports, wiring eight total nodes (4 Sparks plus 4 similar boxes: Dell GB10, MSI Edge Expert, ASUS Ascent GX10) and re‑does IP/MTU (jumbo frames), SSH mesh (7×8–8 self‑links), and model distribution.
- For Qwen‑3 34B BF16 on all 8 nodes, token generation only rises to ~60–64 tokens/s from ~50+ on 4 nodes, with much better prefill but marginal generation gains because the model is too small to benefit from that many shards.
- On Qwen‑VL 32B BF16 (dense, ~66 GB on disk), scaling is more meaningful:
- ~3.6 t/s on 1 node, ~6.1 t/s on 2 nodes, ~11.4 t/s on 4 nodes.
- Each node uses ~63 GB VRAM due to model plus KV cache and context; scaling is “pretty damn good” for both prefill and generation across 4 nodes.
Extremely large models (hundreds of GB)
- He targets cluster‑only, cutting‑edge models that cannot fit on a single 512 GB machine:
- Qwen‑3.5 397B (Mixture‑of‑Experts, “Active 17”), ~800 GB on disk.
- Takes ~7 min to shard across 8 nodes plus ~3 min to build CUDA graphs.
- Runs at ~24 tokens/s generation with ~112 GB used per node (out of ~119 GB), which he frames as a strong result for a huge, SOTA model that literally cannot run on fewer than 8 GPUs in this setup.
- “Kimi‑2” (another large VLM‑style model, ~600 GB), loads in ~15 min, uses ~115 GB per node and reaches ~13.3 tokens/s generation, which is slower but still only feasible on the full 8‑node cluster.
- Qwen‑3.5 397B (Mixture‑of‑Experts, “Active 17”), ~800 GB on disk.
Key takeaways
- Three critical factors for LLMs on clusters are: GPU compute (prefill speed), memory bandwidth (decode/generation speed), and especially latency once you go multi‑node; RoCE and RDMA are essential to keep latency in the low microseconds.
- For small and mid‑size models, adding many nodes gives diminishing returns in generation speed; the cluster really shines for truly massive dense or MoE models that exceed single‑node memory or where you need huge context/concurrency.
- With 8 DGX‑class nodes and a correctly tuned 400 GbE RoCE fabric, he can practically run ~800 GB‑class frontier models at 10–25 tokens/s, which he considers a successful proof that 1 TB “VRAM” hobbyist‑level clusters are viable—albeit very expensive and fiddly to wire and configure.
1 Like