Назад
2 месяца назад

Principal Deployment Engineer (AI Infrastructure)

Формат работы
onsite
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Principal Deployment Engineer (AI Infrastructure): Leading hands-on bringup of GPU clusters in data center environments with an accent on hardware integration, high-speed networking, and performance validation. Focus on building repeatable deployment processes, troubleshooting complex distributed systems, and ensuring production readiness for large-scale AI workloads.

Location: Must be based in the United States (Travel Required)

Company

Nscale is a startup building next-generation AI infrastructure, delivering performant and scalable GPU clusters for frontier AI training and inference.

What you will do

  • Execute end-to-end bringup of GPU nodes and racks from installation to production readiness.
  • Validate BIOS, BMC, firmware configurations, and GPU health.
  • Configure and validate high-speed network fabrics including InfiniBand and RoCE.
  • Perform cluster-wide burn-in, stress testing, and performance validation using NCCL and RDMA.
  • Contribute to automation for provisioning and improve deployment playbooks.
  • Coordinate with hardware vendors and cross-functional teams to resolve bringup issues.

Requirements

  • Must be based in the United States and comfortable with travel.
  • 7–8+ years in infrastructure engineering, hardware deployment, or data center operations.
  • Hands-on experience deploying GPU servers such as HGX or DGX platforms.
  • Strong knowledge of high-speed networking fabrics and Linux systems.
  • Experience troubleshooting distributed systems performance issues.

Nice to have

  • Experience in AI/ML infrastructure or HPC environments.
  • Familiarity with NCCL, CUDA, and RDMA.
  • Automation skills using Python, Ansible, Terraform, or Bash.
  • Experience in high-density power and cooling environments.

Culture & Benefits

  • Opportunity to build foundational AI infrastructure from zero to scale.
  • Fast-paced startup environment with a bias toward action and ownership.
  • Direct impact on the foundational technology powering frontier AI workloads.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →