Job Location: Remote
Please find JD below for your reference.
Top Skills
GPU hardware , Azure Administrator , HPC
Job Summary
We are seeking a highly skilled GPU Infrastructure Engineer to join our team. This role focuses on the design, implementation, and management of enterprise network and cloud-based infrastructure to support evolving Azure cloud needs. The ideal candidate will have a strong background in software, network, or systems engineering, along with hands-on experience in managing large-scale cloud and data center operations.
Responsibilities
- Respond to incidents during regular on-call rotations and resolve issues efficiently to minimize downtime.
- Design and plan scalable GPU infrastructure solutions to meet organizational capacity and performance needs.
- Collaborate with cross-functional teams to define and implement GPU infrastructure architecture that aligns with business objectives.
- Evaluate GPU technologies and recommend the best hardware and software configurations.
- Configure and deploy GPU servers, including installation and setup of hardware, software, and networking components.
- Coordinate with vendors for procurement and installation of GPUs and related infrastructure.
- Implement and manage GPU clustering setups for compute-intensive tasks.
- Utilize monitoring tools to assess GPU performance metrics and system health.
- Conduct benchmarking tests and analyze the results to identify performance bottlenecks.
- Optimize workload distribution across GPU resources to ensure maximum efficiency.
- Provide expert troubleshooting support for reporting and resolving GPU-related issues experienced by team members.
- Maintain incident response protocols to address hardware and software failures swiftly and effectively.
- Develop FAQs and knowledge base articles to streamline support processes for internal users.
- Infrastructure Maintenance:
- Schedule and perform routine maintenance, including updates to software, firmware, and drivers related to GPU systems.
- Plan and execute capacity upgrades and expansions as needed, ensuring minimal disruption to services.
- Conduct post-mortem analyses on significant incidents to improve overall system reliability.
- Write scripts for automation of deployment, configuration management, and system monitoring tasks (e.g., Python, Bash).
- Develop tools that increase productivity for engineering and data science teams using GPUs.
- Implement Infrastructure as Code (IaC) practices for efficient and repeatable deployments.
Requirements
• Bachelor’s or Master’s Degree in Computer Science, Information Technology, or a related field.
Technical Experience:
• Proven expertise in software engineering, network engineering, or systems administration.
• Hands-on experience with managing and debugging cloud backend server and networking infrastructure and services.
• Strong understanding of enterprise network and cloud-based architectures, including experience working with Cisco and Azure.
• Experience with cloud platforms providing GPU services (e.g., AWS, Google Cloud, Azure).
• Understanding virtualization technologies (e.g., Docker, Kubernetes) and server orchestration tools.
• Knowledge of network configurations and storage solutions used in GPU environments.
• Strong understanding of GPU architectures (NVIDIA CUDA, AMD ROCm, etc.).
• Experience with AI/ML workloads, HPC, or rendering applications.
• Familiarity with PCIe, memory subsystems (DDR, HBM), and high-speed I/O.
• Understanding of Azure Pipeline , Azure DevOps.
• Demonstrated knowledge in deploying servers and network infrastructure equipment at scale.
Specialized Skills:
• Experience working with GPU hardware or related system engineering.
• Experience with:
o Data center architecture and cloud infrastructure.
o Network infrastructure design and management in hybrid environments.
• Certifications in relevant technologies such as:
o Cisco (e.g., CCNA /CCNP).
o AZ900(Manadatory) , AZ104 (Optional).
o OCI Foundations Associate (Optional)
o ITIL or equivalent certifications (Optional).
Other comments
GPU Knowledge Mandatory


