Google has revealed information about its supercomputers, used for training artificial intelligence (AI) models, claiming that it outperforms Nvidia in terms of speed and power efficiency.
Alphabet Inc’s Google’s Tensor Processing Unit (TPU) is a custom-designed chip used for over 90 percent of the company’s artificial intelligence training.
In a scientific paper released on Tuesday, Google explained how it has connected more than 4,000 of these chips using custom-developed optical switches to form a supercomputer. The large size of language models has made improving the connections between chips a key point of competition among AI supercomputer builders.
Google’s PaLM model was trained by splitting it across two of the 4,000-chip supercomputers over 50 days. Google’s supercomputers offer flexibility in the interconnect topology, enabling performance gains and the ability to route around failed components.
Google said the fourth-generation TPU is up to 1.7 times faster and 1.9 times more power-efficient than a comparable system based on Nvidia’s A100 chip, according to Google’s paper. While Google’s system has been online since 2020 in a data center in Oklahoma, it has not compared its fourth-generation to Nvidia’s current H100 chip because the latter came to market later and is made with newer technology.
Google hinted that it might be working on a new TPU to compete with the Nvidia H100, but provided no further details.
“Circuit switching makes it easy to route around failed components,” Google Fellow Norm Jouppi and Google Distinguished Engineer David Patterson wrote in a blog post about the system. “This flexibility allows us to change the topology of the supercomputer interconnect to accelerate the performance of a machine learning model.”