Назад
12 дней назад

Senior Infrastructure Support Engineer (GPUs)

Формат работы
onsite
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
Singapore
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Infrastructure Support Engineer (GPUs): Maintaining and optimizing high-performance GPU cloud infrastructure for AI workloads with an accent on Linux systems engineering, Kubernetes, and high-speed networking. Focus on resolving complex technical incidents, automating operational tasks, and improving system observability.

Location: Singapore (Onsite)

Company

Nscale is a GPU cloud engineered specifically for AI, providing cost-effective, high-performance infrastructure for AI start-ups and large enterprises.

What you will do

  • Participate in the Support duty rotation, collaborating with Infrastructure, SRE, and Product Engineering on incidents and changes.
  • Proactively improve dashboards, alerts, and runbooks to prevent repeat incidents.
  • Manage and resolve technical tickets while keeping internal and external stakeholders informed.
  • Design and implement automation scripts and tools to optimize operational processes.
  • Conduct root cause analysis (RCA) for major incidents and recommend long-term architectural fixes.
  • Respond to critical incidents during out-of-business hours as part of an on-call rotation.

Requirements

  • Location: Must be based in Singapore with ability to provide onsite technical expertise.
  • Expertise in Linux systems engineering at scale, including kernel modules and networking stack troubleshooting.
  • Experience operating and troubleshooting Kubernetes (K8s) clusters.
  • Practical experience with GPU platforms (NVIDIA/AMD), including drivers, nvidia-smi, and NCCL diagnostics.
  • Strong networking fundamentals: L2/L3, BGP, VLANs, VXLAN, and high-performance fabrics (RDMA/NVLink).
  • Proficiency in Bash, Python, or JavaScript, and infrastructure automation tools (Ansible, Terraform, Puppet, or Chef).

Nice to have

  • Experience with automated network deployment and configuration in critical environments.
  • Knowledge of GPU HPC concepts, including InfiniBand, MPI, and Pyxis/Enroot.
  • Experience building CI/CD pipelines using GitOps tooling and GitHub Actions.

Culture & Benefits

  • Culture of relentless innovation, ownership, and high accountability.
  • Environment based on openness, transparency, and candid communication.
  • Customer-centric focus with a commitment to delivering impactful AI solutions.
  • Strong emphasis on sustainability and long-term environmental responsibility.
  • Inclusive workplace with an equal opportunities statement for diverse backgrounds.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →