Job Location: Hyderabad/Bangalore

Remote/Hybrid

Role

The InfiniBand Architect will lead the end-to-end architecture, design, deployment, and optimization of high-performance InfiniBand fabrics for large-scale GPU clusters supporting AI/ML and HPC workloads. This role owns topology design (Fat-Tree/Clos), routing and congestion control strategy, SHARP integration, hardware evaluation for next-generation bandwidth (NDR/XDR), and operational excellence through automation, telemetry, and mentoring of engineering teams.

Core Responsibilities

  • Lead the design of high-performance topologies, specifically Fat-Tree (Clos) optimized for massive GPU clusters and non-blocking communication.
  • Define the strategy for Subnet Manager placement, redundancy, and failover to ensure stable fabric initialization and continuous operation.
  • Select and tune routing engines (e.g., Up/Down, Fat-Tree) and configure InfiniBand Congestion Control to eliminate credit loops and head-of-line blocking.
  • Architect the integration of NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) to offload data reduction tasks from GPUs to the network.
  • Establish baseline metrics for fabric latency and throughput, using tools like perftest and ib_send_lat to validate the environment against AI workload requirements.
  • Drive the technical evaluation of NDR (400G) and XDR (800G) hardware, including switches, HCAs (Host Channel Adapters), and optical interconnects.
  • Align with GPU systems teams and data center facility managers to manage power, cooling, and cable management for high-density racks.
  • Lead and mentor mid-level network engineers, establishing best practices for fabric operations and documentation.

Mandatory Skillset

  • Deep understanding of the InfiniBand architecture layers, including LID/GUID management, Partition Keys (P_Keys), and Virtual Lanes (VLs).
  • Expert-level knowledge of GPUDirect RDMA and the implementation of high-speed storage over InfiniBand.
  • Proficiency in advanced troubleshooting using ibdiagnet, ibqueryerrors, and fabric-wide telemetry to identify flapping links or symbol errors.
  • Expertise in kernel-level tuning, including PCIe settings and OFED driver optimization.
  • Strong hands-on experience designing and operating large-scale InfiniBand fabrics for AI/HPC environments.
  • Experience with routing and congestion control configuration in InfiniBand, and performance validation using benchmarking tools.

Optional Skillsets

  • Experience with NVIDIA SHARP deployment, tuning, and validation for collective offload in GPU clusters.
  • Familiarity with automation and configuration management (e.g., Ansible, Python) for fabric provisioning and compliance.
  • Knowledge of Kubernetes/Slurm integration patterns for RDMA-capable workloads and multi-tenant fabrics.
  • Experience with optical planning (OSFP/QSFP, AOCs, transceivers), link budget analysis, and structured cabling for high-density environments.
  • Exposure to multi-site or multi-fabric architectures, gatewaying (e.g., IB to Ethernet) and interconnect resiliency strategies.

Qualifications

  • Bachelor’s degree in Computer Science, Electrical Engineering, or related field (Master’s preferred) or equivalent practical experience.
  • 8+ years of experience in high-performance networking, with 4+ years focused on InfiniBand in AI/HPC environments.
  • Demonstrated experience leading architecture initiatives, influencing cross-functional stakeholders, and mentoring engineers.
  • Strong documentation and communication skills, with the ability to translate performance requirements into scalable designs.

Apply for this position

Allowed Type(s): .pdf, .doc, .docx