Job Location: Hyderabad/Bangalore
Remote/Hybrid
Role
The InfiniBand Architect will lead the end-to-end architecture, design, deployment, and optimization of high-performance InfiniBand fabrics for large-scale GPU clusters supporting AI/ML and HPC workloads. This role owns topology design (Fat-Tree/Clos), routing and congestion control strategy, SHARP integration, hardware evaluation for next-generation bandwidth (NDR/XDR), and operational excellence through automation, telemetry, and mentoring of engineering teams.
Core Responsibilities
- Lead the design of high-performance topologies, specifically Fat-Tree (Clos) optimized for massive GPU clusters and non-blocking communication.
- Define the strategy for Subnet Manager placement, redundancy, and failover to ensure stable fabric initialization and continuous operation.
- Select and tune routing engines (e.g., Up/Down, Fat-Tree) and configure InfiniBand Congestion Control to eliminate credit loops and head-of-line blocking.
- Architect the integration of NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) to offload data reduction tasks from GPUs to the network.
- Establish baseline metrics for fabric latency and throughput, using tools like perftest and ib_send_lat to validate the environment against AI workload requirements.
- Drive the technical evaluation of NDR (400G) and XDR (800G) hardware, including switches, HCAs (Host Channel Adapters), and optical interconnects.
- Align with GPU systems teams and data center facility managers to manage power, cooling, and cable management for high-density racks.
- Lead and mentor mid-level network engineers, establishing best practices for fabric operations and documentation.
Mandatory Skillset
- Deep understanding of the InfiniBand architecture layers, including LID/GUID management, Partition Keys (P_Keys), and Virtual Lanes (VLs).
- Expert-level knowledge of GPUDirect RDMA and the implementation of high-speed storage over InfiniBand.
- Proficiency in advanced troubleshooting using ibdiagnet, ibqueryerrors, and fabric-wide telemetry to identify flapping links or symbol errors.
- Expertise in kernel-level tuning, including PCIe settings and OFED driver optimization.
- Strong hands-on experience designing and operating large-scale InfiniBand fabrics for AI/HPC environments.
- Experience with routing and congestion control configuration in InfiniBand, and performance validation using benchmarking tools.
Optional Skillsets
- Experience with NVIDIA SHARP deployment, tuning, and validation for collective offload in GPU clusters.
- Familiarity with automation and configuration management (e.g., Ansible, Python) for fabric provisioning and compliance.
- Knowledge of Kubernetes/Slurm integration patterns for RDMA-capable workloads and multi-tenant fabrics.
- Experience with optical planning (OSFP/QSFP, AOCs, transceivers), link budget analysis, and structured cabling for high-density environments.
- Exposure to multi-site or multi-fabric architectures, gatewaying (e.g., IB to Ethernet) and interconnect resiliency strategies.
Qualifications
- Bachelor’s degree in Computer Science, Electrical Engineering, or related field (Master’s preferred) or equivalent practical experience.
- 8+ years of experience in high-performance networking, with 4+ years focused on InfiniBand in AI/HPC environments.
- Demonstrated experience leading architecture initiatives, influencing cross-functional stakeholders, and mentoring engineers.
- Strong documentation and communication skills, with the ability to translate performance requirements into scalable designs.


