Назад
9 дней назад

Reliability Engineer (Supercomputing)

350 000 - 475 000$
Формат работы
onsite
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Релокация
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Reliability Engineer (Supercomputing): Ensuring the stability and performance of large-scale GPU supercomputing clusters with an accent on hardware, firmware, and OS-level diagnostics. Focus on investigating complex hardware failures, automating fleet reliability monitoring, and collaborating with vendors to resolve root causes for frontier AI research.

Location: Must be based in San Francisco, California

Compensation: $350,000 - $475,000 USD

Company

Thinking Machines Lab is an AI research organization dedicated to advancing collaborative general intelligence and building widely used AI tools.

What you will do

  • Investigate, reproduce, and remediate issues across large GPU clusters.
  • Manage the drivers, kernel surface, and diagnostics spanning hardware, firmware, and OS.
  • Automate fleet reliability monitoring and analyze error rates to validate fixes.
  • Drive the firmware lifecycle, including tracking, qualification, and regression analysis.
  • Engage directly with hardware and server vendors to secure technical fixes and manage RMA processes.
  • Write detailed postmortems and vendor cases to drive systemic improvements.

Requirements

  • Must be based in San Francisco, California
  • Bachelor’s degree or equivalent experience in computer science or engineering.
  • Proficiency in Python or Rust.
  • Experience operating large-scale clusters and container orchestration systems like Kubernetes or Slurm.
  • Strong ability to own projects end-to-end and work across cross-functional teams.
  • Visa sponsorship available for qualified candidates.

Nice to have

  • Fluency with Linux systems and kernel-level debugging.
  • Experience with out-of-band management (BMC, iDRAC, IPMI, Redfish).
  • Deep knowledge of GPU hardware health (Xid errors, NVLink, NVSwitch, DCGM).
  • Proven statistical rigor in reliability analysis.

Culture & Benefits

  • Generous health, dental, and vision insurance.
  • Unlimited PTO and paid parental leave.
  • Relocation support provided as needed.
  • Collaborative environment working on frontier AI research.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →