Microsoft has taken a decisive step to reshape the economics of large-scale AI with the introduction of Maia 200, its most powerful in-house inference accelerator to date. Rather than positioning the chip as a simple performance upgrade, Microsoft is framing Maia 200 as a strategic infrastructure shift aimed at one core challenge facing AI adoption today: the soaring cost of generating tokens at scale.

At a time when enterprises are racing to deploy larger and more capable AI models, inference efficiency has become just as critical as raw compute. Maia 200 is designed squarely for this phase of AI, where serving models reliably, quickly and cost-effectively matters more than headline benchmark wins alone, Microsoft said in a blog post.
Designed for the inference era
Maia 200 is fabricated on TSMC’s advanced 3-nanometer process and packs more than 140 billion transistors into a single accelerator. The chip is optimized for low-precision compute, reflecting the industry’s shift toward FP8 and FP4 formats for inference-heavy workloads. Each Maia 200 delivers more than 10 petaFLOPS of FP4 performance and over 5 petaFLOPS at FP8, operating within a 750-watt system-on-chip power envelope.
Microsoft claims this makes Maia 200 its most efficient inference system so far, offering around 30 percent better performance per dollar than the latest hardware already deployed across its data centers. In practical terms, the company says the accelerator can comfortably handle today’s largest production models while leaving room for the next generation of even more demanding AI systems.
Memory and data movement take center stage
Instead of focusing only on compute, Maia 200 tackles one of the biggest bottlenecks in modern AI systems: feeding data to massive models fast enough to keep them fully utilized. The accelerator integrates 216 GB of HBM3e memory delivering up to 7 TB per second of bandwidth, complemented by 272 MB of on-chip SRAM.
This memory subsystem is tightly coupled with specialized data movement engines, a custom network-on-chip fabric and narrow-precision data paths. The result is higher token throughput and more consistent performance, especially for large language models that are sensitive to latency and memory stalls.
Industry-Leading AI Accelerator Capabilities
The comparison highlights Azure Maia 200 as a performance-focused inference accelerator that outpaces rival hyperscaler chips in several key areas. Built on a 3-nanometer process like AWS Trainium3 and Google TPU v7, Maia 200 distinguishes itself with significantly higher low-precision compute, delivering over 10,000 FP4 TFLOPS and more than 5,000 FP8 TFLOPS – well ahead of Trainium3 and competitive with TPU v7 in FP8.
Maia 200 also offers a larger memory footprint, with 216 GB of HBM3E, compared with 144 GB on Trainium3 and 192 GB on TPU v7. Its memory bandwidth of 7 TB per second is close to TPU v7 and clearly higher than Trainium3, supporting faster data movement for large models.
In scale-up networking, Maia 200 leads with 2.8 TB per second of bidirectional bandwidth, more than double TPU v7 and at the top end of Trainium3’s range. Overall, the data positions Maia 200 as a strong inference-optimized platform, combining superior low-precision compute, large high-bandwidth memory and robust interconnects to improve performance per dollar for large-scale AI workloads.
Built into Microsoft’s AI stack
Maia 200 is not a standalone experiment. It is a core component of Microsoft’s heterogeneous AI infrastructure and will support a wide range of workloads across Azure. The accelerator is already being used to run advanced models, including the latest GPT-5.2 systems from OpenAI, powering services such as Microsoft Foundry and Microsoft 365 Copilot.
Internally, Microsoft’s Superintelligence team is also using Maia 200 for synthetic data generation and reinforcement learning. The accelerator’s architecture is particularly well suited for creating and filtering large volumes of high-quality, domain-specific synthetic data, which can then be fed back into training pipelines to improve future models.
Scaling with standard networking
At the system level, Maia 200 introduces a two-tier scale-up network built on standard Ethernet rather than proprietary interconnects. Each accelerator provides 2.8 TB per second of bidirectional scale-up bandwidth, enabling predictable collective operations across clusters of up to 6,144 accelerators.
Within a server tray, four Maia accelerators are directly connected through non-switched links, keeping high-bandwidth traffic local and reducing latency. The same transport protocol is used within racks and across racks, simplifying scaling while lowering power consumption and total cost of ownership across Microsoft’s global cloud footprint.
Cloud-native from day one
Microsoft emphasizes that Maia 200 was designed with the data center in mind from its earliest stages. A pre-silicon development environment modeled the real computation and communication patterns of large language models, allowing engineers to co-design silicon, networking and system software long before the first chips were manufactured.
This approach paid off in deployment speed. According to Microsoft, AI models were running on Maia 200 within days of the first packaged parts arriving, and the time from first silicon to full data center rack deployment was less than half that of comparable AI infrastructure programs. Native integration with Azure’s control plane also brings built-in security, telemetry and management at both chip and rack levels.
Developer access through the Maia SDK
To encourage adoption, Microsoft is previewing the Maia software development kit, which provides tools to build and optimize models specifically for Maia 200. The SDK includes PyTorch support, a Triton compiler, optimized kernel libraries, a low-level programming language and simulation tools to estimate performance and cost early in the development cycle.
Maia 200 is currently deployed in Microsoft’s US Central data center region near Des Moines, with expansion planned for the US West 3 region near Phoenix and additional locations to follow.
With Maia 200, Microsoft is signaling that the next phase of AI innovation will be defined not just by bigger models, but by smarter infrastructure. By focusing on inference efficiency, system-level design and tight cloud integration, the company is positioning its custom silicon as a long-term foundation for AI at global scale.
RAJANI BABURAJAN

