infotechlead

Strategies for optimizing network deployments to support AI workloads

As artificial intelligence (AI) becomes core to enterprise transformation strategies, CIOs are increasingly focused on ensuring that their data center networks can efficiently support GPU-intensive AI workloads.

Naresh Singh, Sr Director Analyst at Gartner
Naresh Singh, Sr Director Analyst at Gartner

According to a Gartner report, most current enterprise switch deployments are not optimized for AI and generative AI (GenAI) traffic patterns — leading to up to 30 percent loss in processing efficiency through 2028. To unlock the full potential of AI investments, infrastructure and operations (I&O) leaders must rethink their network design and deployment strategies.

# Adopt AI-Specific Network Architectures

Traditional data center topologies are often ill-suited for the unique traffic characteristics of GPUs. AI workloads demand high bandwidth, low latency, and lossless data transfer across tightly coupled compute nodes. Organizations should:
Build dedicated, high-performance switching layers specifically for AI traffic.
Flatten network topologies to minimize physical tiers, reducing latency and congestion.
Consider rack- or row-optimized architectures that align with AI infrastructure footprints.

# Select the Right Interconnect Technologies

Choosing the appropriate networking fabric is critical. Each option — Ethernet, Infiniband, or custom rack-scale interconnects — has trade-offs in performance, supportability, and scalability.
Infiniband offers ultra-low latency and is often preferred for AI training workloads.
RDMA over Converged Ethernet (RoCE)-based Ethernet technologies, designed for high-throughput and lossless performance, are gaining traction in Ethernet-based AI environments.
Rack-optimized links can help achieve dense, efficient interconnects where space and power are at a premium.

# Enhance Ethernet Networks for AI Readiness

For enterprises committed to Ethernet, it is essential to evolve standard configurations:
Build lossless Ethernet fabrics with technologies like RDMA over Converged Ethernet (RoCE).
Select vendors that support Ultra Ethernet Consortium standards for better end-to-end optimization.
Deploy switches that support deep buffer architectures and AI-specific traffic scheduling.

# Prioritize Vendor Interoperability and Co-Certification

AI infrastructure is often composed of best-of-breed components. To reduce deployment risk and improve performance:
Work with vendors that offer co-certified solutions — validating that GPUs, switches, and software stacks work seamlessly together.
Consider “full-stack” AI platforms that integrate compute, storage, and networking into pre-tested configurations.

# Plan for Edge and On-Prem AI Growth

Gartner projects that by 2028, over 20 percent of enterprises will run AI training and inference workloads locally — up from fewer than 2 percent in early 2025. This shift demands more strategic investment in on-prem data center networks:
Anticipate growth in AI edge deployments where latency sensitivity and data sovereignty require local processing.
Ensure that networking infrastructure in these environments can scale dynamically and support AI-ready interconnects.

# Budget for Performance, Not Just Cost

While networking may represent a smaller portion of AI infrastructure budgets, its impact on performance is disproportionately high:
A poorly optimized network can reduce GPU utilization and elongate training times.
Investing in high-performance networking can improve return on AI investments by accelerating outcomes.

GenAI development cost
GenAI development cost

TIPS

Many infrastructure leaders tasked with on-prem deployments end up getting too caught up with the GPU decision and not paying enough attention to other important aspects of the infrastructure like the network and the software stack. I&O leaders must avoid making such mistakes. If you have the best GPUs but can’t use it properly due to ineffective network or software capabilities, then that translates to wastage of costly resources and significant project delays, which can lead to operational failures and even business loss.

I&O leaders must design and architect their infrastructure based on the key objectives and scope of the AI team. This requires a close coordination with key stake holders like senior executives, business unit leaders, data science and machine learning team, and the security group. Only through this can one arrive at a good selection and sizing of the infrastructure needs. It is best to start in this journey through modest investments and iterate and scale from there – learning and building a good practice with relevant KPIs and good governance.

Insights on networking practices

Organizations with highly skilled engineering teams pushing the boundaries of innovations by investing in their own customized solutions or working with vendors and communities to drive standards and opensource technologies that drive efficiencies and economies of scale. Example: Google’s TPUs, AWS Trainium, UEC, UAL. The objective here is to achieve differentiation and achieve significant value from the entire value chain including the infrastructure.

Depending on leading infrastructure vendors with a tried and tested technology stack including ML/data-science tools and libraries. These are customers that want to focus more on their core business and the application layers, Naresh Singh, Sr Director Analyst at Gartner, said.

As enterprises accelerate AI adoption, their networks must evolve to keep pace with the demands of GPU-centric workloads. I&O leaders must adopt new design principles, embrace emerging standards, and work closely with vendors to ensure that AI infrastructure is fully supported by high-performance, scalable, and efficient network architectures. Optimizing data center networks for AI is no longer optional — it’s a prerequisite for realizing the business value of AI at scale.

Baburajan Kizhakedath

Latest

More like this
Related

HR tech firms step up AI and reveal job reduction

Artificial Intelligence is playing a pivotal role in reshaping...

OpenAI and Perplexity AI take aim at Google with AI-powered web browsers

OpenAI and Perplexity AI are taking direct aim at...

AI content moderation faces scrutiny after Grok’s controversial posts

Elon Musk’s artificial intelligence company, xAI, is facing scrutiny...

Tech firms offer up to $20 mn a year to secure AI employees

The race to dominate artificial intelligence is driving tech...