Twitter (X) is working on an AI project:
A few details about their implementation:
At the frontier of deep learning research, reliable infrastructure must be built with the same care as datasets and learning algorithms. To create Grok, we built a custom training and inference stack based on Kubernetes, Rust, and JAX.
LLM training runs like a freight train thundering ahead; if one car derails, the entire train is dragged off the tracks, making it difficult to set upright again. There are a myriad of ways GPUs fail: manufacturing defects, loose connections, incorrect configuration, degraded memory chips, the occasional random bit flip, and more. When training, we synchronize computations across tens of thousands of GPUs for months on end, and all these failure modes become frequent due to scale. To overcome these challenges, we employ a set of custom distributed systems that ensure that every type of failure is immediately identified and automatically handled. At xAI, we have made maximizing useful compute per watt the key focus of our efforts. Over the past few months, our infrastructure has enabled us to minimize downtime and maintain a high Model Flop Utilization (MFU) even in the presence of unreliable hardware.
Rust has proven to be an ideal choice for building scalable, reliable, and maintainable infrastructure. It offers high performance, a rich ecosystem, and prevents the majority of bugs one would typically find in a distributed system. Given our small team size, infrastructure reliability is crucial, otherwise, maintenance starves innovation. Rust provides us with confidence that any code modification or refactor is likely to produce working programs that will run for months with minimal supervision.