Amazon.com has announced Trainium2 AI chip tailored explicitly for training AI systems within its cloud computing service in a bid to intensify its competition with Microsoft and fortify its position in the artificial intelligence (AI) market.
Amazon Web Services (AWS) CEO Adam Selipsky introduced Trainium2 at AWS re:Invent, a conference in Las Vegas, boasting a four-fold increase in speed compared to its predecessor while exhibiting twice the energy efficiency.
Anthropic, Databricks, Datadog, Epic, Honeycomb, and SAP, among customers, are using the new AWS-designed chips.
AWS says Graviton4 processors deliver up to 30 percent better compute performance, 50 percent more cores, and 75 percent more memory bandwidth than Graviton3. Graviton4 raises the bar on security by encrypting all high-speed physical hardware interfaces.
Graviton4 will be available in memory-optimized Amazon EC2 R8g instances, enabling customers to improve the execution of their high-performance databases, in-memory caches, and big data analytics workloads. R8g instances offer larger instance sizes with up to 3x more vCPUs and 3x more memory than current generation R7g instances. This allows customers to process larger amounts of data, scale their workloads, improve time-to-results, and lower their total cost of ownership.
Customers including Datadog, DirecTV, Discovery, Formula 1 (F1), NextRoll, Nielsen, Pinterest, SAP, Snowflake, Sprinklr, Stripe, and Zendesk use Graviton-based instances to run a range of workloads, such as databases, analytics, web servers, batch processing, ad serving, application servers, and microservices.
Customers including Databricks, Helixon, Money Forward, and the Amazon Search team use Trainium to train large-scale deep learning models.
AWS says Trainium2 is designed to deliver up to 4x faster training performance and 3x more memory capacity compared to first generation Trainium chips, while improving energy efficiency (performance/watt) up to 2x.
Trainium2 will be available in Amazon EC2 Trn2 instances, containing 16 Trainium chips in a single instance. Trn2 instances are intended to enable customers to scale up to 100,000 Trainium2 chips in next generation EC2 UltraClusters, interconnected with AWS Elastic Fabric Adapter (EFA) petabit-scale networking, delivering up to 65 exaflops of compute and giving customers on-demand access to supercomputer-class performance. With this level of scale, customers can train a 300-billion parameter LLM in weeks versus months.
The unveiling of Trainium2 comes hot on the heels of Microsoft’s recent announcement of its AI chip, Maia, fueling the rivalry between the tech giants in the realm of AI hardware innovation. Amazon’s Trainium2 chip is set to vie for supremacy against AI chips offered by Alphabet’s Google, which has been providing its Tensor Processing Unit (TPU) to cloud computing customers since 2018, Reuters news report said.
Selipsky outlined AWS’s plan to roll out the new training chips in the upcoming year. The surge in the development and release of customized chips is a direct response to the escalating demand for enhanced computing power, especially in the creation of extensive language models that underpin services akin to ChatGPT.
These cloud computing behemoths are positioning their chips as supplementary to Nvidia, the predominant AI chip leader whose products have faced scarcity over the past year. AWS, concurrently, disclosed its intent to provide Nvidia’s latest chips on its cloud service, expanding the array of AI offerings.
Additionally, Selipsky introduced Graviton4, AWS’s fourth bespoke central processor chip, boasting a 30 percent improvement in speed compared to its predecessor. This announcement comes shortly after Microsoft’s declaration of its custom chip, Cobalt, aimed at challenging Amazon’s Graviton series.
Both AWS and Microsoft are leveraging technology from Arm Ltd. in their chip designs, marking a continual shift away from Intel and Advanced Micro Devices (AMD) chips in cloud computing infrastructures.
Oracle, for its cloud service, is turning to chips from startup Ampere Computing, further delineating a landscape characterized by diversifying chip architectures and a race for innovation dominance in the burgeoning AI market.
Anthropic, an AI safety and research company, recently launched Claude – an AI assistant. “Since launching on Amazon Bedrock, Claude has seen rapid adoption from AWS customers,” said Tom Brown, co-founder of Anthropic.
“We are working with AWS to develop our future foundation models using Trainium chips. Trainium2 will help us build and train models at a very large scale, and we expect it to be at least 4x faster than first generation Trainium chips for some of our key workloads.”
More than 10,000 organizations — including Comcast, Conde Nast, and over 50 percent of the Fortune 500 — rely on Databricks to unify their data, analytics, and AI.
“Thousands of customers have implemented Databricks on AWS, giving them the ability to use MosaicML to pre-train, finetune, and serve FMs for a variety of use cases,” said Naveen Rao, vice president of Generative AI at Databricks.
“AWS Trainium gives us the scale and high performance needed to train our Mosaic MPT models, and at a low cost. As we train our next generation Mosaic MPT models, Trainium2 will make it possible to build models even faster, allowing us to provide our customers unprecedented scale and performance so they can bring their own generative AI applications to market more rapidly.”
Datadog, an observability and security platform, is already running half of Amazon EC2 fleet on Graviton.
“Integrating Graviton4-based instances into our environment was seamless, and gave us an immediate performance boost out of the box, and we’re looking forward to using Graviton4 when it becomes generally available,” said Laurent Bernaille, principal engineer at Datadog.
Epic, a leading entertainment company and provider of 3D engine technology, operates Fortnite, one of the world’s largest games with over 350 million accounts and 2.5 billion friend connections.
“AWS Graviton4 instances are the fastest EC2 instances we’ve ever tested, and they are delivering outstanding performance across our most competitive and latency sensitive workloads,” said Roman Visintine, lead cloud engineer at Epic. “We look forward to using Graviton4 to improve player experience and expand what is possible within Fortnite.”
Honeycomb, an observability platform, has evaluated AWS Graviton4-based R8g instances.
“In tests, our Go-based OpenTelemetry data ingestion workload required 25 percent fewer replicas on the Graviton4-based R8g instances compared to Graviton3-based C7g/M7g/R7g instances—and additionally achieved a 20 percent improvement in median latency and 10 percent improvement in 99th percentile latency,” said Liz Fong-Jones, Field CTO at Honeycomb.
SAP HANA Cloud, SAP’s cloud-native in-memory database, is the data management foundation of SAP Business Technology Platform (SAP BTP).
“As part of the migration process of SAP HANA Cloud to AWS Graviton-based Amazon EC2 instances, we have already seen up to 35 percent better price performance for analytical workloads. In the coming months, we look forward to validating Graviton4, and the benefits it can bring to our joint customers,” said Juergen Mueller, CTO of SAP.