Job Location: Remote

Please find JD below for your reference.

Top Skills

GPU hardware , Azure Administrator , HPC

Job Summary

We are seeking a highly skilled GPU Infrastructure Engineer to join our team. This role focuses on the design, implementation, and management of enterprise network and cloud-based infrastructure to support evolving Azure cloud needs. The ideal candidate will have a strong background in software, network, or systems engineering, along with hands-on experience in managing large-scale cloud and data center operations.

Responsibilities

  • Respond to incidents during regular on-call rotations and resolve issues efficiently to minimize downtime.
  • Design and plan scalable GPU infrastructure solutions to meet organizational capacity and performance needs.
  • Collaborate with cross-functional teams to define and implement GPU infrastructure architecture that aligns with business objectives.
  • Evaluate GPU technologies and recommend the best hardware and software configurations.
  • Configure and deploy GPU servers, including installation and setup of hardware, software, and networking components.
  • Coordinate with vendors for procurement and installation of GPUs and related infrastructure.
  • Implement and manage GPU clustering setups for compute-intensive tasks.
  • Utilize monitoring tools to assess GPU performance metrics and system health.
  • Conduct benchmarking tests and analyze the results to identify performance bottlenecks.
  • Optimize workload distribution across GPU resources to ensure maximum efficiency.
  • Provide expert troubleshooting support for reporting and resolving GPU-related issues experienced by team members.
  • Maintain incident response protocols to address hardware and software failures swiftly and effectively.
  • Develop FAQs and knowledge base articles to streamline support processes for internal users.
  • Infrastructure Maintenance:
  • Schedule and perform routine maintenance, including updates to software, firmware, and drivers related to GPU systems.
  • Plan and execute capacity upgrades and expansions as needed, ensuring minimal disruption to services.
  • Conduct post-mortem analyses on significant incidents to improve overall system reliability.
  • Write scripts for automation of deployment, configuration management, and system monitoring tasks (e.g., Python, Bash).
  • Develop tools that increase productivity for engineering and data science teams using GPUs.
  • Implement Infrastructure as Code (IaC) practices for efficient and repeatable deployments.

Requirements

•   Bachelor’s or Master’s Degree in Computer Science, Information Technology, or a related field.

Technical Experience:

•   Proven expertise in software engineering, network engineering, or systems administration.

•   Hands-on experience with managing and debugging cloud backend server and networking infrastructure and services.

•   Strong understanding of enterprise network and cloud-based architectures, including experience working with Cisco and Azure.

•   Experience with cloud platforms providing GPU services (e.g., AWS, Google Cloud, Azure).

•   Understanding virtualization technologies (e.g., Docker, Kubernetes) and server  orchestration tools.

•    Knowledge of network configurations and storage solutions used in GPU environments.

•    Strong understanding of GPU architectures (NVIDIA CUDA, AMD ROCm, etc.).

•    Experience with AI/ML workloads, HPC, or rendering applications.

•     Familiarity with PCIe, memory subsystems (DDR, HBM), and high-speed I/O.

•     Understanding of Azure Pipeline , Azure DevOps.

•     Demonstrated knowledge in deploying servers and network infrastructure equipment at scale.

Specialized Skills:

•     Experience working with GPU hardware or related system engineering.

•      Experience with:

o            Data center architecture and cloud infrastructure.

o            Network infrastructure design and management in hybrid environments.

•             Certifications in relevant technologies such as:

o            Cisco (e.g., CCNA /CCNP).

o            AZ900(Manadatory) , AZ104 (Optional).

o            OCI Foundations Associate (Optional)

o            ITIL or equivalent certifications (Optional).

Other comments

GPU Knowledge Mandatory

Apply for this position

Allowed Type(s): .pdf, .doc, .docx